Advanced Guide¶
Duration: ~252 minutes | Difficulty: Intermediate | Format: YAML, Python & CLI
Go deeper with Seeknal's advanced capabilities: multi-format file sources, data quality rules, pipeline lineage visualization, named references, shared configuration, Python pipelines, database/external source connections with incremental detection, Iceberg incremental processing, custom sources, and pipeline tags.
What You'll Learn¶
Take your Seeknal skills to the next level with advanced features that improve pipeline quality, maintainability, and observability:
- File Sources - Load CSV, Parquet, and JSONL data into your pipeline
- Transformations - Clean, join, and aggregate data with SQL
- Data Rules - Validate data quality with automated checks
- Lineage & Inspection - Visualize data flow and debug pipeline outputs
- Named ref() References - Self-documenting, reorder-safe SQL references
- Common Configuration - Shared column mappings, rules, and SQL snippets
- Data Profiling - Compute statistics and validate with threshold checks
- Python Pipelines - Build nodes with Python decorators and mix with YAML
- Database & External Sources - Connect to PostgreSQL, StarRocks, and Iceberg with incremental detection
- Iceberg Incremental Processing - Snapshot detection, watermark tracking, and selective cascade
- Custom Sources - Bring data from REST APIs, cloud storage, and any Python-accessible system
- Pipeline Tags - Organize nodes with tags and run filtered subsets of your pipeline
Prerequisites¶
Before starting, ensure you have:
- At least one Learning Path completed (Data Engineer, Analytics Engineer, or ML Engineer)
- Familiarity with the
draft → dry-run → applyworkflow - Seeknal installed and available on your PATH
- Basic SQL knowledge (SELECT, WHERE, JOIN, GROUP BY)
Chapters¶
Chapter 1: File Sources (~20 minutes)¶
Load data from CSV, JSONL, and Parquet files:
products.csv → source.products
sales_events.jsonl → source.sales_events
sales_snapshot.parquet → source.sales_snapshot
You'll learn:
- Creating sources from different file formats
- The draft → dry-run → apply workflow for each
- Exploring data with seeknal repl
- How file formats differ in schema handling
Chapter 2: Transformations (~20 minutes)¶
Clean, join, and aggregate your source data:
source.products ────────────┐
├──→ sales_enriched (JOIN)
source.sales_events ────────┘
└──→ sales_summary (aggregation)
You'll learn:
- Single-input transforms with ref() syntax
- Multi-input transforms (JOINs)
- Aggregation transforms (GROUP BY)
- Running a full pipeline with seeknal plan and seeknal run
Chapter 3: Data Rules (~25 minutes)¶
Validate data quality with automated rule checks:
transform.events_cleaned ──→ rule.not_null_quantity (null check)
──→ rule.positive_quantity (range check)
source.products ──→ rule.valid_prices (range on source)
transform.events_cleaned ──→ rule.no_duplicate_events (sql_assertion)
You'll learn:
- Creating rule nodes for data validation
- Expression-based rules (null, range, freshness)
- SQL assertion rules (dbt-style custom SQL checks)
- Severity levels: error vs warn
- Integrating rules into your pipeline DAG
Chapter 4: Lineage & Inspection (~17 minutes)¶
Visualize data lineage and inspect intermediate pipeline outputs:
seeknal lineage → Full DAG (HTML)
seeknal lineage transform.sales_enriched → Focused node view
seeknal lineage transform.X --column total → Column-level trace
seeknal lineage --ascii → ASCII tree to stdout
seeknal inspect transform.sales_enriched → Data preview
You'll learn: - Interactive HTML lineage visualization with Cytoscape.js - Focused node and column-level lineage tracing - ASCII tree output for terminal use and AI agent consumption - Inspecting intermediate node outputs for debugging - Schema inspection for column types
Chapter 5: Named ref() References (~15 minutes)¶
Refactor transforms to use self-documenting named references:
Before: SELECT * FROM input_0 s JOIN input_1 p ON ...
After: SELECT * FROM ref('source.products') p JOIN ref('transform.events_cleaned') e ON ...
You'll learn:
- Named ref() syntax instead of positional input_0
- Self-documenting SQL that's safe to reorder
- Mixed syntax and error handling
- Security validation for ref() arguments
Chapter 6: Common Configuration (~20 minutes)¶
Centralize column mappings, business rules, and SQL snippets:
seeknal/common/
├── sources.yml → {{products.idCol}}, {{products.priceCol}}
├── rules.yml → {{rules.validPrice}}, {{rules.hasQuantity}}
└── transformations.yml → {{transforms.priceCalc}}
You'll learn:
- Source column mappings with {{ dotted.key }} syntax
- Reusable SQL filter expressions and snippets
- Resolution priority (context > env > common config)
- Typo detection with suggestions
Chapter 7: Data Profiling & Validation (~20 minutes)¶
Compute statistical profiles and validate with threshold-based quality checks:
source.products ──→ profile.products_stats ──→ rule.products_quality
│
└── row_count, avg, stddev, null_percent,
distinct_count, top_values, freshness
You'll learn:
- Computing 14+ metrics per column with kind: profile
- Auto-detection of column types (numeric, timestamp, string)
- Threshold-based quality checks with type: profile_check rules
- Soda-style expressions: "> 5", "= 0", "between 10 and 500"
Chapter 8: Python Pipelines (~25 minutes)¶
Build pipeline nodes using Python decorators and mix them with existing YAML nodes:
source.products (YAML) ───────────┐
├──→ transform.customer_analytics (Python)
transform.sales_enriched (YAML) ──┘
source.exchange_rates (Python) ────────→ transform.category_insights (Python)
You'll learn:
- Creating Python sources and transforms with @source and @transform
- PEP 723 per-file dependency isolation
- Referencing YAML nodes from Python via ctx.ref()
- Running mixed YAML + Python pipelines
Chapter 9: Database & External Sources (~32 minutes)¶
Connect to PostgreSQL, StarRocks (MySQL), and Iceberg lakehouse tables:
PostgreSQL → source.pg_customers (table scan)
→ source.events (incremental detection)
→ source.pg_active_orders (pushdown query)
StarRocks → source.sr_daily_metrics (MySQL protocol)
Iceberg → source.ice_events (REST catalog)
You'll learn:
- Connection profiles with env var interpolation (profiles.yml)
- PostgreSQL table scan, pushdown query, and incremental detection sources
- Automatic watermark tracking and WHERE clause injection for incremental reads
- Skip optimization for unchanged sources and --full refresh override
- StarRocks sources via MySQL protocol (pymysql)
- Iceberg sources via Lakekeeper REST catalog with OAuth2
- source_defaults for per-type default connections
Chapter 10: Iceberg Incremental Processing (~30 minutes)¶
Detect Iceberg data changes, load only new rows, and cascade selectively:
Iceberg (events) ──→ transform.event_summary ──→ transform.enriched_events
▲ ▲
watermark tracked selective cascade
in run_state.json (only changed branches)
CSV (categories) ─────────────────────────────────────┘
You'll learn:
- Snapshot-based change detection (automatic caching)
- Partition-pruned incremental reads with
freshness.time_column - Watermark tracking and NULL-safe filters
- Mixed-source cascade (Iceberg + CSV)
- Full refresh override with
--full
Chapter 11: Custom Sources (~20 minutes)¶
Bring data from REST APIs, cloud storage, and any Python-accessible system:
REST API (Open-Meteo) → transform.api_weather_data
S3/MinIO (boto3) → transform.s3_inventory_data
Faker (synthetic data) → transform.generated_synthetic_data
↓
transform.enriched_report (joins all three)
You'll learn:
- When to use @transform vs @source for data ingestion
- REST API sources with retry and error handling
- Cloud storage sources using boto3
- Synthetic data generation for testing
- Best practices: timeouts, credentials, idempotency
Chapter 12: Pipeline Tags (~15 minutes)¶
Organize nodes with tags and run, plan, or visualize filtered subsets:
seeknal run --tags churn_pipeline → Run tagged nodes + upstream deps
seeknal plan --tags revenue_pipeline → Filtered execution plan
seeknal lineage --tags ml --ascii → ASCII tree with [tag] annotations
seeknal run --tags ml --exclude-tags exp → Include then exclude
You'll learn:
- Adding tags to YAML nodes and Python decorators
- Running filtered subsets with --tags
- Filter composition rules (--tags + --exclude-tags + --nodes)
- Filtered plan and lineage visualization with tag annotations
Continue Learning¶
Explore other persona paths or dive into the reference documentation:
| Path | Focus | Time |
|---|---|---|
| Data Engineer → | ELT pipelines, incremental processing, production environments | ~75 min |
| Analytics Engineer → | Semantic models, business metrics, BI deployment | ~75 min |
| ML Engineer → | Feature stores, aggregations, entity consolidation | ~115 min |
Key Commands You'll Learn¶
# Initialize a project
seeknal init --name my-project
# Draft, validate, and apply nodes
seeknal draft source my_source
seeknal draft transform my_transform
seeknal draft rule my_rule
seeknal draft profile my_profile
seeknal draft source my_source --python # Python source template
seeknal draft transform my_transform --python # Python transform template
seeknal dry-run draft_source_my_source.yml
seeknal apply draft_source_my_source.yml
# Build and run pipeline
seeknal plan
seeknal run
# Explore data interactively
seeknal repl
# Visualize data lineage
seeknal lineage
seeknal lineage transform.my_transform --column my_col
seeknal lineage --ascii # ASCII tree to stdout
seeknal lineage transform.my_transform --ascii # Focused ASCII tree
seeknal lineage --tags revenue_pipeline --ascii # Tag-filtered ASCII tree
# Inspect intermediate outputs
seeknal inspect transform.my_transform
# Preview resolved SQL (ref() and {{ }} expressions)
seeknal dry-run seeknal/transforms/my_transform.yml
# Run filtered by tags
seeknal run --tags churn_pipeline
seeknal run --tags ml --exclude-tags experimental
# Override common config at runtime
seeknal run --params events.quantityCol=units_sold
# Run with connection profile
seeknal run --profile profiles.yml
seeknal dry-run draft_source_pg.yml --profile profiles.yml
Last updated: March 2026 | Seeknal Documentation