ML Engineer Path¶
Duration: ~130 minutes | Format: Python Pipeline | Prerequisites: Python, DE Path Chapter 1 completed
Build production feature stores and ML models using Python pipeline decorators (@source, @feature_group, @transform) and Seeknal's declarative YAML SOA engine.
What You'll Learn¶
The ML Engineer path teaches you to build production-grade feature stores and ML models with Seeknal's Python pipeline API. You'll learn to:
- Build Feature Stores — Create feature groups with
@feature_group, evolve schemas iteratively - Second-Order Aggregations — Generate hierarchical features with the YAML SOA engine (basic, window, ratio)
- Point-in-Time Joins & Training-Serving Parity — Build PIT-correct training data with
FeatureFrame.pit_join(), temporal SOA features, and online serving - Entity Consolidation — Merge feature groups into per-entity views with struct columns
- End-to-End ML with MLflow — Train propensity models, track experiments, and run batch predictions
Prerequisites¶
Before starting this path, ensure you have:
- DE Path Chapter 1 completed — You'll use the e-commerce data
- Python 3.11+ and
uvinstalled (curl -LsSf https://astral.sh/uv/install.sh | sh) - Understanding of ML features and training data
Chapters¶
Chapter 1: Build a Feature Store (~30 minutes)¶
Create feature groups using Python decorators:
You'll build:
- Python sources with @source decorator and PEP 723 dependencies
- Feature groups with @feature_group and ctx.ref()
- Schema evolution workflow for iterating on features
Chapter 2: Second-Order Aggregations (~30 minutes)¶
Generate hierarchical features from raw transactions:
source.transactions (Ch.1) → feature_group.customer_daily_agg → second_order_aggregation.region_metrics
(Python @feature_group) (YAML SOA engine)
├── SUM, COUNT per day ├── basic: sum, mean, max, stddev
└── application_date ├── window: recent 7-day totals
└── ratio: recent vs past spending
You'll build:
- Feature groups with @feature_group and ctx.duckdb.sql()
- YAML SOA with declarative features: spec (basic, window, ratio)
- Time-window features using application_date_col
Chapter 3: Point-in-Time Joins & Training-Serving Parity (~35 minutes)¶
Build a production ML pipeline with temporal correctness:
source.churn_labels (spine with application_date)
↓
@transform: pit_training_data
PIT-joins customer_daily_agg via FeatureFrame.pit_join()
↓
SOA: customer_training_features (per-customer temporal features)
↓
@transform: churn_model (scikit-learn)
↓
REPL: Online serving demo (ctx.features())
You'll learn:
- Point-in-time joins with FeatureFrame.pit_join() to prevent data leakage
- Per-customer SOA temporal features (spending trends, recency)
- Training scikit-learn models inside @transform nodes
- Online serving with ctx.features() for training-serving parity
Chapter 4: Entity Consolidation (~15 minutes)¶
Consolidate multiple feature groups into unified entity views:
feature_group.customer_features ──┐
├──→ Entity Consolidation ──→ entity_customer
feature_group.product_preferences ┘ (automatic) ↓
REPL Exploration
↓
seeknal entity list/show
You'll build:
- A second feature group (product_preferences) for the customer entity
- Automatic consolidation with struct-namespaced columns
- CLI commands to inspect consolidated entities
Chapter 5: End-to-End ML with MLflow (~20 minutes)¶
Build a complete ML workflow with experiment tracking and batch predictions:
transform.training_dataset (Ch4) ──→ transform.train_propensity (scikit-learn + MLflow)
↓
mlruns/ (model artifact)
↓
second_order_aggregation.customer_training_features ─┐
├──→ transform.score_customers
feature_group.product_preferences ───────────────────┘ ↓
REPL: Propensity ranking
You'll build: - A training pipeline that logs experiments to MLflow (params, metrics, model artifact) - A prediction pipeline that loads the trained model and scores all customers - Propensity scores with ranks and segments queryable in the REPL - Separation of training and inference — the production pattern
What You'll Build¶
By the end of this path, you'll have a complete ML pipeline:
| Component | Decorator / Tool | Purpose |
|---|---|---|
| Sources | @source |
Declare data ingestion (CSV, Parquet, DB) |
| Feature Groups | @feature_group |
Compute and version ML features |
| Transforms | @transform |
Data prep, PIT joins, model training |
| SOA | YAML features: spec |
Hierarchical meta-features (basic, window, ratio) |
| PIT Joins | FeatureFrame.pit_join() |
Temporally correct training data |
| Online Serving | ctx.features() |
Training-serving parity for inference |
| Entity Consolidation | CLI + REPL | Unified per-entity views with struct columns |
| Experiment Tracking | MLflow | Log parameters, metrics, and model artifacts |
| Batch Predictions | @transform + MLflow |
Load trained models, score customers |
Key Commands You'll Learn¶
# Python pipeline templates
seeknal draft source <name> --python --deps pandas
seeknal draft feature-group <name> --python --deps pandas,duckdb
seeknal draft transform <name> --python --deps pandas,scikit-learn
seeknal draft second-order-aggregation <name>
# Preview and apply
seeknal dry-run <draft_file>.py
seeknal apply <draft_file>.py
seeknal apply <draft_file>.yml
# Pipeline execution
seeknal plan
seeknal run
# Feature management
seeknal validate-features <fg_name> --mode fail
seeknal lineage <node> --ascii
# Entity consolidation
seeknal entity list
seeknal entity show <entity_name>
seeknal consolidate
# ML experiment tracking
mlflow ui --backend-store-uri file:./mlruns
# Interactive verification
seeknal repl
Resources¶
Reference¶
- Python Pipelines Guide — Full decorator reference and patterns
- Entity Consolidation Guide — Cross-FG retrieval and materialization
- CLI Reference — All commands and flags
- YAML Schema Reference — Feature group and SOA schemas
Related Paths¶
- Data Engineer Path — ELT pipelines (prerequisite)
- Analytics Engineer Path — Semantic layers and metrics
- Advanced Guide: Python Pipelines — Mixed YAML + Python
Last updated: February 2026 | Seeknal Documentation