Chapter 2: Second-Order Aggregations¶

Duration: 30 minutes | Difficulty: Advanced | Format: Python Transform + SOA Engine

Learn to generate hierarchical features from transaction data using a Python @feature_group for first-order aggregation and Seeknal's SOA engine for second-order meta-features — with built-in support for basic, window, and ratio aggregations. Choose between YAML (declarative) or Python (@second_order_aggregation decorator) to define your SOA nodes.

What You'll Build¶

Building on Chapter 1's feature store, you'll add a hierarchical feature pipeline:

(from Chapter 1)
source.transactions ──→ feature_group.customer_features
         ↓
feature_group.customer_daily_agg ──→ second_order_aggregation.region_metrics
         (Python @feature_group)              (YAML or Python SOA)
                                              ├── basic: sum, mean, max, stddev
                                              ├── window: recent 7-day totals
                                              └── ratio: recent vs past spending

After this chapter, you'll have: - A Python feature group computing first-order daily aggregations via ctx.ref() - An SOA node using the built-in aggregation engine (basic, window, ratio) — via YAML or Python - Pipeline execution with REPL verification - Understanding of hierarchical feature engineering

Prerequisites¶

Before starting, ensure you've completed:

Chapter 1: Feature Store — source.transactions and feature_group.customer_features created
Comfortable with aggregation concepts (SUM, COUNT, window functions)

Part 1: Understand Second-Order Aggregations (5 minutes)¶

First-Order vs Second-Order¶

First-order aggregation: Raw data → Features

Transactions → GROUP BY customer_id → total_spent, avg_amount, order_count

Second-order aggregation: Features → Meta-features

Customer features → GROUP BY region → avg_customer_spend, max_order_count
                  → Time windows    → recent 7-day totals, spending trends
                  → Ratios          → recent_vs_past spending direction

When to Use Second-Order Aggregations¶

Use Case	First-Order	Second-Order
Customer total spend	`SUM(amount)` per customer	Average of customer spends per region
Purchase frequency	`COUNT(*)` per customer	Stddev of frequencies across regions
Spending trends	Not possible	Ratio of recent vs past spending per region

Why Second-Order?

Second-order aggregations capture patterns across aggregated entities — e.g., how customers in a region behave collectively, or how spending trends shift over time. These features are powerful for ML models because they capture context and distribution patterns.

Seeknal's SOA Engine¶

Seeknal provides an SOA engine that generates optimized DuckDB SQL from a simple features: spec. You can define SOA nodes in YAML (declarative) or Python (@second_order_aggregation decorator). Both approaches use the same engine — you declare what to compute:

Aggregation Type	What It Does	Example
basic	Simple aggregation over all history	`basic: [sum, mean, max]`
window	Aggregation over a time window (days)	`window: [1, 7]` → last 7 days
ratio	Ratio of two time windows	`numerator: [1, 7]`, `denominator: [8, 14]`
since	Conditional aggregation	`condition: "flag = 1"`

The engine computes _days_between (the difference between the feature date and a reference application date), then uses it to filter time windows automatically.

The `application_date` Column¶

The SOA engine's time-window and ratio features require a reference date to answer "how many days ago did this event happen?" This reference date is called application_date_col.

How it works:

_days_between = date_diff('day', feature_date_col, application_date_col)

For example, if order_date is 2026-01-15 and application_date is 2026-01-21:

_days_between = 6  (the order was 6 days before the reference date)

The engine then uses _days_between to filter time windows:

window: [1, 7] → keeps rows where _days_between is between 1 and 7
ratio: {numerator: [1, 7], denominator: [8, 14]} → compares days 1–7 vs days 8–14

What value should application_date be?

Scenario	Value	Why
Tutorial / testing	Fixed date like `'2026-01-21'`	Reproducible results; pick 1 day after your latest data
Daily batch pipeline	`CURRENT_DATE`	Features always relative to "today"
ML prediction	The date you're predicting for	Features describe history as of that date
Backfill	The batch run date parameter	Use `ctx.params.get('run_date')` for parameterized runs

Critical: application_date must exist in the upstream data

The SOA engine expects application_date_col as an actual column in the upstream DataFrame — it is not generated automatically. Your first-order transform must include it in its output. If it's missing, you'll get:

Binder Error: Referenced column "application_date" not found in FROM clause!

The fix is to add application_date to your transform's SELECT clause (see Step 1 below).

All aggregation types respect _days_between — including basic aggregations (no window or ratio). The engine filters to _days_between >= 0, meaning only rows where feature_date <= application_date are included. This prevents data leakage by ensuring no future data is aggregated.

Part 2: Build the Pipeline (15 minutes)¶

Reuse Chapter 1's Source¶

Chapter 1 already created source.transactions from data/transactions.csv with columns: customer_id, order_id, order_date, revenue, product_category, region. We'll reference it directly — no need to create a new source.

Step 1: Create the First-Order Feature Group (Python)¶

This feature group aggregates raw transactions to daily per-customer metrics — the input for the SOA engine. We use @feature_group (instead of @transform) because daily customer metrics are features — they capture temporal purchase behavior that downstream nodes (SOA, training pipelines) will consume.

The output must include both order_date (as feature_date_col) and application_date (as application_date_col).

seeknal draft feature-group customer_daily_agg --python --deps pandas,duckdb

Edit draft_feature_group_customer_daily_agg.py:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "pandas",
#     "pyarrow",
#     "duckdb",
# ]
# ///

"""Feature Group: Daily aggregation per customer for SOA input."""

from seeknal.pipeline import feature_group


@feature_group(
    name="customer_daily_agg",
    description="Daily aggregation per customer with region context",
)
def customer_daily_agg(ctx):
    """Aggregate transactions to daily customer level."""
    txn = ctx.ref("source.transactions")

    return ctx.duckdb.sql("""
        SELECT
            customer_id,
            region,
            CAST(order_date AS DATE) AS order_date,
            CAST('2026-01-21' AS DATE) AS application_date,
            CAST(SUM(revenue) AS DOUBLE) AS daily_amount,
            CAST(COUNT(*) AS BIGINT) AS daily_count,
            CURRENT_TIMESTAMP AS event_time
        FROM txn
        GROUP BY customer_id, region, CAST(order_date AS DATE)
    """).df()

Why @feature_group without entity?

We omit the entity= parameter here because this feature group has a daily granularity (one row per customer per day), not an entity-level granularity (one row per customer). Without entity, it won't participate in entity consolidation (Chapter 4), but it's still a feature group — it computes reusable features that the SOA engine and training pipelines consume.

The application_date column in practice

We use a fixed date (2026-01-21) that is one day after the latest transaction in our sample data. This ensures all transactions have a positive _days_between value and fall within calculable window ranges. In production, replace with CURRENT_DATE or a parameterized run date. See the application_date section above for details.

Both order_date and application_date must appear in the SELECT — the SOA engine reads them from the upstream output to compute _days_between. If either is missing, you'll get a "column not found" error.

seeknal dry-run draft_feature_group_customer_daily_agg.py
seeknal apply draft_feature_group_customer_daily_agg.py

Checkpoint: The file moves to seeknal/feature_groups/customer_daily_agg.py.

Step 2: Create the Second-Order Aggregation (YAML)¶

Now create the meta-features using Seeknal's declarative SOA engine:

seeknal draft second-order-aggregation region_metrics

This creates draft_second_order_aggregation_region_metrics.yml. Edit it:

kind: second_order_aggregation
name: region_metrics
description: "Region-level meta-features from customer daily aggregations"
id_col: region
feature_date_col: order_date
application_date_col: application_date
source: feature_group.customer_daily_agg
features:
  daily_amount:
    basic: [sum, mean, max, stddev]
  daily_count:
    basic: [sum, mean]

Understanding the YAML spec:

Field	Value	Purpose
`id_col`	`region`	Group-by column for second-order aggregation
`feature_date_col`	`order_date`	Date column for time-window calculations
`application_date_col`	`application_date`	Reference date column for `_days_between` (must exist in upstream output)
`source`	`feature_group.customer_daily_agg`	Upstream feature group to aggregate from
`features`	`daily_amount`, `daily_count`	Feature columns and their aggregation types

The basic: [sum, mean, max, stddev] declaration tells the SOA engine to generate four aggregations per feature — no manual SQL or pandas code needed.

Apply the YAML definition:

seeknal apply draft_second_order_aggregation_region_metrics.yml

Checkpoint: The file moves to seeknal/second_order_aggregations/region_metrics.yml.

Alternative: Python SOA Decorator¶

Instead of YAML, you can define the same SOA node in Python using @second_order_aggregation. The decorator uses the same built-in engine — the difference is that your function body loads the data via ctx.ref(), giving you full Python flexibility for joins, filters, or multi-source inputs.

seeknal draft second-order-aggregation region_metrics --python --deps pandas,duckdb

Edit draft_second_order_aggregation_region_metrics.py:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "pandas",
#     "pyarrow",
#     "duckdb",
# ]
# ///

"""Second-order aggregation: Region-level meta-features."""

from seeknal.pipeline import second_order_aggregation


@second_order_aggregation(
    name="region_metrics",
    id_col="region",
    feature_date_col="order_date",
    application_date_col="application_date",
    features={
        "daily_amount": {"basic": ["sum", "mean", "max", "stddev"]},
        "daily_count": {"basic": ["sum", "mean"]},
    },
)
def region_metrics(ctx):
    """Load data for SOA engine — the engine handles aggregation."""
    return ctx.ref("feature_group.customer_daily_agg")

seeknal dry-run draft_second_order_aggregation_region_metrics.py
seeknal apply draft_second_order_aggregation_region_metrics.py

YAML vs Python SOA — When to Use Which

Aspect	YAML	Python (`@second_order_aggregation`)
Data loading	`source:` points to one upstream node	`ctx.ref()` — join multiple sources, filter, transform
Aggregation	Built-in engine	Same built-in engine
Best for	Simple single-source SOA	Complex data prep before aggregation
Syntax	Declarative YAML	Python function + decorator

Both produce identical output — the engine generates the same DuckDB SQL. Choose Python when you need custom data preparation (joins, filters, computed columns) before aggregation.

Step 3: Plan and Run¶

Generate the execution plan and run the full pipeline:

seeknal plan
seeknal run

Expected output:

Seeknal Pipeline Execution
============================================================
1/4: transactions [RUNNING]
  SUCCESS in 0.01s
  Rows: 10
2/4: customer_features [RUNNING]
  SUCCESS in 1.2s
  Rows: 6
3/4: customer_daily_agg [RUNNING]
  SUCCESS in 1.2s
  Rows: 10
4/4: region_metrics [RUNNING]
  SUCCESS in 0.8s
  Rows: 4
✓ State saved
✓ Consolidated entity 'customer': 1 FGs, 6 rows in 0.01s

Chapter 1 Nodes Run Too

seeknal run executes the full pipeline. You'll see transactions and customer_features from Chapter 1 run alongside the new nodes. The consolidation output shows customer_features (the only FG with entity="customer") being consolidated — customer_daily_agg has no entity so it's excluded from consolidation.

Part 3: Verify and Iterate (10 minutes)¶

Inspect Generated Features¶

seeknal repl

-- View region-level meta-features
SELECT * FROM second_order_aggregation_region_metrics;

Expected output:

+--------+------------------+-------------------+-------------------+---------------------+------------------+-------------------+
| region | daily_amount_SUM | daily_amount_MEAN | daily_amount_MAX  | daily_amount_STDDEV | daily_count_SUM  | daily_count_MEAN  |
+--------+------------------+-------------------+-------------------+---------------------+------------------+-------------------+
| east   |            75.25 |             75.25 |             75.25 |                NULL |                1 |               1.0 |
| north  |           384.93 |            96.233 |            199.95 |              73.xxx |                4 |               1.0 |
| south  |           749.49 |           187.373 |            250.00 |              68.xxx |                4 |               1.0 |
| west   |            45.99 |             45.99 |             45.99 |                NULL |                1 |               1.0 |
+--------+------------------+-------------------+-------------------+---------------------+------------------+-------------------+

Auto-Generated Column Names

The SOA engine names output columns as {feature}_{AGG} — e.g., daily_amount_SUM, daily_amount_MEAN. This convention keeps features traceable to their source and aggregation type.

Check Feature Columns¶

-- See all generated feature columns
DESCRIBE second_order_aggregation_region_metrics;

You'll see columns like: - daily_amount_SUM, daily_amount_MEAN, daily_amount_MAX, daily_amount_STDDEV — Basic aggregations of spending - daily_count_SUM, daily_count_MEAN — Basic aggregations of transaction counts

Compare Regions¶

SELECT
    region,
    ROUND(daily_amount_SUM, 2) AS total_spend,
    ROUND(daily_amount_MEAN, 2) AS avg_spend,
    ROUND(daily_amount_STDDEV, 2) AS spend_stddev
FROM second_order_aggregation_region_metrics
ORDER BY daily_amount_SUM DESC;

Checkpoint: You should see 4 regions. south has the highest total spend (749.49), while north shows the highest spending variability. Regions with a single transaction (east, west) have NULL stddev.

Add Window and Ratio Features¶

The real power of the SOA engine is its time-window and ratio features. Add them to your SOA definition:

YAMLPython

Edit seeknal/second_order_aggregations/region_metrics.yml:

kind: second_order_aggregation
name: region_metrics
description: "Region-level meta-features from customer daily aggregations"
id_col: region
feature_date_col: order_date
application_date_col: application_date
source: feature_group.customer_daily_agg
features:
  daily_amount:
    basic: [sum, mean, max, stddev]
  daily_count:
    basic: [sum, mean]
  # NEW: Recent 7-day spending (days 1-7 before application_date)
  recent_spending:
    window: [1, 7]
    basic: [sum]
    source_feature: daily_amount
  # NEW: Ratio of recent vs past week spending
  spending_trend:
    ratio:
      numerator: [1, 7]
      denominator: [8, 14]
      aggs: [sum]
    source_feature: daily_amount

Edit seeknal/second_order_aggregations/region_metrics.py:

@second_order_aggregation(
    name="region_metrics",
    id_col="region",
    feature_date_col="order_date",
    application_date_col="application_date",
    features={
        "daily_amount": {"basic": ["sum", "mean", "max", "stddev"]},
        "daily_count": {"basic": ["sum", "mean"]},
        # NEW: Recent 7-day spending
        "recent_spending": {
            "window": [1, 7],
            "basic": ["sum"],
            "source_feature": "daily_amount",
        },
        # NEW: Ratio of recent vs past week spending
        "spending_trend": {
            "ratio": {"numerator": [1, 7], "denominator": [8, 14], "aggs": ["sum"]},
            "source_feature": "daily_amount",
        },
    },
)
def region_metrics(ctx):
    return ctx.ref("feature_group.customer_daily_agg")

What the new features do:

Feature	Type	Meaning
`recent_spending`	window [1, 7]	Total spending in the last 7 days (Jan 14–20)
`spending_trend`	ratio [1,7]/[8,14]	Recent week spend / past week spend — values >1 mean increasing

Re-run:

seeknal run

Verify the new time-window features:

seeknal repl

SELECT
    region,
    ROUND("daily_amount_SUM_1_7", 2) AS recent_7d,
    ROUND("daily_amount_SUM1_7_SUM8_14", 2) AS trend_ratio
FROM second_order_aggregation_region_metrics
ORDER BY "daily_amount_SUM1_7_SUM8_14" DESC NULLS LAST;

Expected output:

+--------+-----------+-------------+
| region | recent_7d | trend_ratio |
+--------+-----------+-------------+
| north  |    334.94 |        6.70 |
| south  |    299.99 |        0.67 |
| west   |     45.99 |        NULL |
| east   |      NULL |        NULL |
+--------+-----------+-------------+

Interpreting results: - north (trend 6.70): Spending surged — 6.7x more in the recent week vs the prior week - south (trend 0.67): Spending declined — recent week is only 67% of the prior week - east/west: Only 1 transaction each, so window features are NULL (no data in the window range)

Congratulations!

You've built a hierarchical feature pipeline combining a Python transform with Seeknal's declarative SOA engine. The YAML features: spec generates optimized SQL automatically — no manual pandas or DuckDB code needed for the aggregation logic.

What Could Go Wrong?¶

Common Pitfalls

1. Missing application_date_col or feature_date_col in upstream data

Symptom: Binder Error: Referenced column "application_date" not found in FROM clause!
Cause: The upstream transform's SELECT clause doesn't include application_date (or order_date).
Fix: Add both columns to your feature group's SQL output. See the application_date section in Part 1 for what value to use. If you've fixed the SQL but still get the error, run seeknal run --full to rebuild the cached output.

2. Wrong source reference format

Symptom: Invalid source reference: 'customer_daily_agg'. Expected format: 'kind.name'
Fix: Source must be kind.name format, e.g., source: feature_group.customer_daily_agg

3. Window features all NULL

Symptom: Window and ratio columns are NULL for all regions
Fix: Check that _days_between values fall within your window range. If application_date is too far from order_date, all transactions fall outside the window. Adjust application_date in your transform.

4. Feature column not found

Symptom: Missing feature columns: {'daily_amount'}
Fix: Ensure the features: key names (or source_feature: values) match actual column names in the upstream feature group output.

5. Empty output (0 rows)

Symptom: 0 rows in result
Fix: Check that id_col matches an actual column in the source feature group output.

Summary¶

In this chapter, you learned:

Feature Groups for Aggregation — Compute first-order daily aggregations with @feature_group and ctx.ref()
SOA Engine — Declare meta-features with features: spec (basic, window, ratio)
YAML vs Python SOA — YAML for simple single-source SOA; Python @second_order_aggregation for custom data prep
Application Date — Reference date column for _days_between computation; must be present in upstream data
Auto-Generated Columns — {feature}_{AGG} naming convention from the SOA engine
Pipeline Execution — Apply, plan, run, and verify with REPL

SOA Features Reference:

Type	YAML Syntax	Output Column
Basic	`basic: [sum, mean]`	`{feature}_SUM`, `{feature}_MEAN`
Window	`window: [1, 7]` + `basic: [sum]`	`{feature}_SUM_1_7`
Ratio	`ratio: {numerator: [1,7], denominator: [8,14], aggs: [sum]}`	`{feature}_SUM1_7_SUM8_14`
Since	`since: {condition: "flag = 1", aggs: [count]}`	`SINCE_COUNT_{feature}_GEO_D`

Key Commands:

seeknal draft feature-group <name> --python             # Python feature group template
seeknal draft second-order-aggregation <name>           # YAML SOA template
seeknal draft second-order-aggregation <name> --python  # Python SOA template
seeknal dry-run <file>.py                               # Preview Python node
seeknal apply <file>.py                                 # Save Python node
seeknal apply <file>.yml                                # Save YAML node
seeknal plan                                            # Generate DAG manifest
seeknal run                                             # Execute pipeline
seeknal repl                                            # Interactive verification

What's Next?¶

Chapter 3: Point-in-Time Joins & Training-Serving Parity →

Build a production ML pipeline with point-in-time correct training data, temporal SOA features, and online serving for inference.

Chapter 2: Second-Order Aggregations¶

What You'll Build¶

Prerequisites¶

Part 1: Understand Second-Order Aggregations (5 minutes)¶

First-Order vs Second-Order¶

When to Use Second-Order Aggregations¶

Seeknal's SOA Engine¶

The application_date Column¶

Part 2: Build the Pipeline (15 minutes)¶

Reuse Chapter 1's Source¶

Step 1: Create the First-Order Feature Group (Python)¶

Step 2: Create the Second-Order Aggregation (YAML)¶

Alternative: Python SOA Decorator¶

Step 3: Plan and Run¶

Part 3: Verify and Iterate (10 minutes)¶

Inspect Generated Features¶

Check Feature Columns¶

Compare Regions¶

Add Window and Ratio Features¶

What Could Go Wrong?¶

Summary¶

What's Next?¶

See Also¶

The `application_date` Column¶