ML Engineer Path¶

Duration: ~130 minutes | Format: Python Pipeline | Prerequisites: Python, DE Path Chapter 1 completed

Build production feature stores and ML models using Python pipeline decorators (@source, @feature_group, @transform) and Seeknal's declarative YAML SOA engine.

What You'll Learn¶

The ML Engineer path teaches you to build production-grade feature stores and ML models with Seeknal's Python pipeline API. You'll learn to:

Build Feature Stores — Create feature groups with @feature_group, evolve schemas iteratively
Second-Order Aggregations — Generate hierarchical features with the YAML SOA engine (basic, window, ratio)
Point-in-Time Joins & Training-Serving Parity — Build PIT-correct training data with FeatureFrame.pit_join(), temporal SOA features, and online serving
Entity Consolidation — Merge feature groups into per-entity views with struct columns
End-to-End ML with MLflow — Train propensity models, track experiments, and run batch predictions

Prerequisites¶

Before starting this path, ensure you have:

DE Path Chapter 1 completed — You'll use the e-commerce data
Python 3.11+ and uv installed (curl -LsSf https://astral.sh/uv/install.sh | sh)
Understanding of ML features and training data

Chapters¶

Chapter 1: Build a Feature Store (~30 minutes)¶

Create feature groups using Python decorators:

source.transactions (Python) ──→ feature_group.customer_features (Python)
                                          ↓
                                    REPL Exploration

You'll build: - Python sources with @source decorator and PEP 723 dependencies - Feature groups with @feature_group and ctx.ref() - Schema evolution workflow for iterating on features

Start Chapter 1 →

Chapter 2: Second-Order Aggregations (~30 minutes)¶

Generate hierarchical features from raw transactions:

source.transactions (Ch.1) → feature_group.customer_daily_agg → second_order_aggregation.region_metrics
         (Python @feature_group)              (YAML SOA engine)
         ├── SUM, COUNT per day               ├── basic: sum, mean, max, stddev
         └── application_date                 ├── window: recent 7-day totals
                                              └── ratio: recent vs past spending

You'll build: - Feature groups with @feature_group and ctx.duckdb.sql() - YAML SOA with declarative features: spec (basic, window, ratio) - Time-window features using application_date_col

Start Chapter 2 →

Chapter 3: Point-in-Time Joins & Training-Serving Parity (~35 minutes)¶

Build a production ML pipeline with temporal correctness:

source.churn_labels (spine with application_date)
         ↓
@transform: pit_training_data
  PIT-joins customer_daily_agg via FeatureFrame.pit_join()
         ↓
SOA: customer_training_features (per-customer temporal features)
         ↓
@transform: churn_model (scikit-learn)
         ↓
REPL: Online serving demo (ctx.features())

You'll learn: - Point-in-time joins with FeatureFrame.pit_join() to prevent data leakage - Per-customer SOA temporal features (spending trends, recency) - Training scikit-learn models inside @transform nodes - Online serving with ctx.features() for training-serving parity

Start Chapter 3 →

Chapter 4: Entity Consolidation (~15 minutes)¶

Consolidate multiple feature groups into unified entity views:

feature_group.customer_features ──┐
                                  ├──→ Entity Consolidation ──→ entity_customer
feature_group.product_preferences ┘         (automatic)              ↓
                                                              REPL Exploration
                                                                     ↓
                                                          seeknal entity list/show

You'll build: - A second feature group (product_preferences) for the customer entity - Automatic consolidation with struct-namespaced columns - CLI commands to inspect consolidated entities

Start Chapter 4 →

Chapter 5: End-to-End ML with MLflow (~20 minutes)¶

Build a complete ML workflow with experiment tracking and batch predictions:

transform.training_dataset (Ch4) ──→ transform.train_propensity (scikit-learn + MLflow)
                                              ↓
                                        mlruns/ (model artifact)
                                              ↓
second_order_aggregation.customer_training_features ─┐
                                                     ├──→ transform.score_customers
feature_group.product_preferences ───────────────────┘          ↓
                                                     REPL: Propensity ranking

You'll build: - A training pipeline that logs experiments to MLflow (params, metrics, model artifact) - A prediction pipeline that loads the trained model and scores all customers - Propensity scores with ranks and segments queryable in the REPL - Separation of training and inference — the production pattern

Start Chapter 5 →

What You'll Build¶

By the end of this path, you'll have a complete ML pipeline:

Component	Decorator / Tool	Purpose
Sources	`@source`	Declare data ingestion (CSV, Parquet, DB)
Feature Groups	`@feature_group`	Compute and version ML features
Transforms	`@transform`	Data prep, PIT joins, model training
SOA	YAML `features:` spec	Hierarchical meta-features (basic, window, ratio)
PIT Joins	`FeatureFrame.pit_join()`	Temporally correct training data
Online Serving	`ctx.features()`	Training-serving parity for inference
Entity Consolidation	CLI + REPL	Unified per-entity views with struct columns
Experiment Tracking	MLflow	Log parameters, metrics, and model artifacts
Batch Predictions	`@transform` + MLflow	Load trained models, score customers

Key Commands You'll Learn¶

# Python pipeline templates
seeknal draft source <name> --python --deps pandas
seeknal draft feature-group <name> --python --deps pandas,duckdb
seeknal draft transform <name> --python --deps pandas,scikit-learn
seeknal draft second-order-aggregation <name>

# Preview and apply
seeknal dry-run <draft_file>.py
seeknal apply <draft_file>.py
seeknal apply <draft_file>.yml

# Pipeline execution
seeknal plan
seeknal run

# Feature management
seeknal validate-features <fg_name> --mode fail
seeknal lineage <node> --ascii

# Entity consolidation
seeknal entity list
seeknal entity show <entity_name>
seeknal consolidate

# ML experiment tracking
mlflow ui --backend-store-uri file:./mlruns

# Interactive verification
seeknal repl

Resources¶

Reference¶

Python Pipelines Guide — Full decorator reference and patterns
Entity Consolidation Guide — Cross-FG retrieval and materialization
CLI Reference — All commands and flags
YAML Schema Reference — Feature group and SOA schemas

Data Engineer Path — ELT pipelines (prerequisite)
Analytics Engineer Path — Semantic layers and metrics
Advanced Guide: Python Pipelines — Mixed YAML + Python

Last updated: February 2026 | Seeknal Documentation