Skip to content

Change Categorization

Seeknal automatically classifies pipeline changes into three categories to determine rebuild scope and prevent breaking downstream consumers.

What Is Change Categorization?

Change categorization is an automatic system that: - Analyzes differences between manifests - Classifies changes by their impact on downstream nodes - Determines which nodes need rebuilding - Prevents breaking changes from silently propagating

When you modify a YAML file, Seeknal computes a diff and assigns each change to one of three categories based on its impact.

Three Categories

BREAKING

Definition: Changes that require rebuilding this node and all downstream dependents.

Why: Downstream nodes may reference fields, columns, or inputs that no longer exist.

Examples: - Column removed or renamed - Column data type changed - Input dependency removed - Node renamed (changes node ID) - Node type changed (source → transform) - Entity key changed (for feature groups)

Downstream impact: All transitive downstream nodes rebuild (via BFS traversal).

CLI indicator: [BREAKING - rebuilds downstream]

NON_BREAKING

Definition: Changes that require rebuilding only this node.

Why: The change affects this node's logic but doesn't invalidate downstream dependencies.

Examples: - Column added (downstream queries still valid) - SQL logic changed (same output schema) - Input dependency added - Feature definition changed - Transform SQL modified (preserves columns) - Aggregation window changed

Downstream impact: Downstream nodes also rebuild (data dependency), but not flagged as breaking.

CLI indicator: [CHANGED - rebuild this node]

METADATA

Definition: Changes that don't require any execution.

Why: Only documentation or tags changed - functional logic unchanged.

Examples: - Description updated - Owner changed - Tags modified - Audit rules added (in config)

Downstream impact: None. Manifest updates only.

CLI indicator: [METADATA - no rebuild needed]

Rules Table

Change Type Category Downstream Impact Example
Column removed BREAKING All downstream total_amount deleted
Column renamed BREAKING All downstream total_amountamount
Column type changed BREAKING All downstream INTEGERSTRING
Column added NON_BREAKING Rebuild only New discount column
SQL transform changed NON_BREAKING Rebuild only New WHERE clause
Input removed BREAKING All downstream Removed source.orders
Input added NON_BREAKING Rebuild only Added source.customers
Feature removed BREAKING All downstream Deleted ML feature
Feature added NON_BREAKING Rebuild only New aggregation
Node renamed BREAKING All downstream orderssales
Node type changed BREAKING All downstream Transform → Model
Description changed METADATA None Docs update
Owner changed METADATA None Team reassignment
Tags changed METADATA None Added pii tag

See Also

How It Works

1. Manifest Diffing

When you run seeknal diff or seeknal plan, Seeknal:

  1. Loads the old manifest (from production or environment)
  2. Parses the new manifest (from current YAML files)
  3. Compares node-by-node to detect changes

2. Change Classification

For each modified node, Seeknal inspects:

# Check columns
if old_columns - new_columns:
    return ChangeCategory.BREAKING  # Column removed

# Check column types
for col in old_columns & new_columns:
    if old_type != new_type:
        return ChangeCategory.BREAKING  # Type changed

# Check inputs
if old_inputs - new_inputs:
    return ChangeCategory.BREAKING  # Input removed

# Check features
if old_features - new_features:
    return ChangeCategory.BREAKING  # Feature removed

# Check metadata fields
if only description/owner/tags changed:
    return ChangeCategory.METADATA

# Default: NON_BREAKING
return ChangeCategory.NON_BREAKING

3. Downstream Propagation

For BREAKING changes, Seeknal uses BFS to find all downstream dependents:

source.orders [BREAKING - column removed]
transform.clean_orders [downstream rebuild]
feature_group.order_features [downstream rebuild]
model.churn_prediction [downstream rebuild]

All four nodes rebuild, but only source.orders is flagged as BREAKING.

4. Execution Plan

The executor: - Skips METADATA-only nodes (manifest update only) - Rebuilds NON_BREAKING nodes - Rebuilds BREAKING nodes + all downstream

Downstream Impact Calculation

BFS Traversal

Seeknal uses breadth-first search to find all transitive downstream nodes:

def _bfs_downstream(node_id: str, manifest: Manifest) -> set[str]:
    downstream = set()
    queue = deque([node_id])
    while queue:
        current = queue.popleft()
        for child in manifest.get_downstream_nodes(current):
            if child not in downstream:
                downstream.add(child)
                queue.append(child)
    return downstream

This ensures all transitive dependencies are rebuilt, preventing stale data.

Propagation Rules

  • BREAKING: This node + all downstream (recursive)
  • NON_BREAKING: This node + downstream (recursive, but not marked breaking)
  • METADATA: No execution, manifest only

Example DAG:

A [BREAKING]
├─ B [downstream]
│  └─ D [downstream]
└─ C [downstream]
   └─ D [downstream]

If A is BREAKING, B, C, and D all rebuild (D detected via both paths).

CLI Command Reference

seeknal diff

Show categorized changes without execution:

seeknal diff

# Output:
# [BREAKING - rebuilds downstream] transform.customer_metrics
#     columns: removed: total_mrr; added: total_monthly_recurring_revenue
#     Downstream impact (1 nodes):
#       -> REBUILD aggregation.churn_rate_analysis
#
# Summary:
#   1 breaking change(s) — downstream nodes will rebuild

seeknal plan <env>

Create environment plan with change categories:

seeknal plan dev

# Output:
# Environment Plan: dev
# ============================================================
#
# Change Detection
# ============================================================
# Modified Nodes:
#   1. clean_orders
#      Category: NON_BREAKING
#      Changed fields: config
#      Details:
#        - transform: Added 'discount_pct' column
#
# Downstream Impact:
#   2. customer_enriched
#      Reason: Depends on clean_orders
#      Category: NON_BREAKING
#
# Execution Plan:
# ============================================================
#   Total nodes to rebuild: 2
#   Breaking changes: 0
#   Non-breaking changes: 2
#   Metadata-only changes: 0

seeknal run --env <name>

Execute plan with categorized rebuilds:

seeknal run --env dev

# Output:
# Applying Environment: dev
# ============================================================
# ℹ Nodes to execute: 2
#
# Execution
# ============================================================
# 1/2: clean_orders [RUNNING]
#   SUCCESS in 0.03s
#
# 2/2: customer_enriched [RUNNING]
#   SUCCESS in 0.04s

CI/CD Integration Patterns

Pull Request Checks

# .github/workflows/pr-check.yml
name: Pipeline PR Check

on: pull_request

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Install Seeknal
        run: pip install seeknal

      - name: Detect changes
        id: diff
        run: |
          seeknal diff > diff_output.txt
          cat diff_output.txt

      - name: Check for breaking changes
        run: |
          if grep -q "BREAKING" diff_output.txt; then
            echo "⚠️ BREAKING CHANGES DETECTED"
            echo "Review required before merge"
            exit 1
          else
            echo "✓ No breaking changes"
          fi

Environment-Based Deployment

# .github/workflows/deploy.yml
name: Deploy Pipeline

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Plan staging deployment
        run: seeknal plan staging

      - name: Apply to staging
        run: seeknal run --env staging --parallel

      - name: Run tests
        run: pytest tests/

      - name: Check for breaking changes
        id: breaking
        run: |
          if seeknal diff | grep -q "BREAKING"; then
            echo "has_breaking=true" >> $GITHUB_OUTPUT
          else
            echo "has_breaking=false" >> $GITHUB_OUTPUT
          fi

      - name: Require manual approval for breaking changes
        if: steps.breaking.outputs.has_breaking == 'true'
        uses: trstringer/manual-approval@v1
        with:
          approvers: team-leads
          issue-title: "Breaking change approval required"

      - name: Promote to production
        run: seeknal promote staging

Automated Notifications

- name: Notify breaking changes
  if: steps.breaking.outputs.has_breaking == 'true'
  uses: slackapi/slack-github-action@v1
  with:
    payload: |
      {
        "text": "⚠️ Breaking changes detected in pipeline",
        "blocks": [
          {
            "type": "section",
            "text": {
              "type": "mrkdwn",
              "text": "*Breaking Changes Detected*\nPR: ${{ github.event.pull_request.html_url }}"
            }
          }
        ]
      }

Advanced Features

Custom Classification Rules

Future extension point for domain-specific rules:

# Future API (not yet implemented)
from seeknal.dag.diff import ChangeClassifier

@ChangeClassifier.register("custom_rule")
def classify_custom(old_node, new_node):
    if old_node.config.get("pii") and not new_node.config.get("pii"):
        return ChangeCategory.BREAKING  # PII flag removed
    return ChangeCategory.METADATA

Change Details

The diff module provides human-readable details for each change:

change.changed_details = {
    "columns": "added: discount_pct, removed: old_price",
    "config": "keys changed: features, entity"
}

These details appear in CLI output and can be used for notifications.

Severity Ordering

Categories have implicit severity for max() comparison:

METADATA < NON_BREAKING < BREAKING

If a node has multiple changes (e.g., column added + column removed), the most severe category wins:

col_added = NON_BREAKING
col_removed = BREAKING
result = max(col_added, col_removed, key=_severity)  # BREAKING

Comparison to Other Tools

vs dbt state:modified

Feature Seeknal Change Categorization dbt state:modified
Granularity 3 categories (BREAKING/NON_BREAKING/METADATA) Binary (modified/not)
Downstream impact Automatic BFS propagation Manual + selector
Column changes Detected and categorized Not detected
Type changes BREAKING Not detected
Metadata-only Skips execution Still executes
CI/CD integration Built-in categories Manual scripting

vs SQLMesh change categories

Feature Seeknal SQLMesh
Categories BREAKING, NON_BREAKING, METADATA Breaking, Non-breaking, Forward-only
Detection Hash-based + schema inspection Plan-based
Propagation BFS downstream Plan rewrite
Execution Categorical rebuild Virtual data
Virtual layers No (full rebuild) Yes (CTE views)

Best Practices

Understanding BREAKING vs NON_BREAKING

BREAKING means: - Downstream queries will fail (column referenced but removed) - Schema contract violated - Manual intervention likely needed

NON_BREAKING means: - Downstream queries still valid (new columns ignored by SELECT *) - Logic changed but interface stable - Safe to auto-promote (with tests)

When to Accept Breaking Changes

Breaking changes are sometimes necessary: - Renaming columns for clarity - Changing types for correctness - Removing deprecated features

Workflow: 1. seeknal plan dev to see downstream impact 2. Update downstream nodes to handle change 3. Test in dev environment 4. Promote together in one deployment

When to Avoid Breaking Changes

Use non-breaking alternatives when possible: - Add new column instead of renaming (deprecate old) - Add type cast logic instead of changing type - Add new input instead of replacing

Example:

# BREAKING: Column renamed
SELECT customer_id AS cust_id  # Downstream breaks

# NON_BREAKING: Add new column, keep old
SELECT
  customer_id,
  customer_id AS cust_id  # Downstream unaffected

Testing Change Categories

Write tests to verify expected categories:

# tests/test_change_categorization.py
def test_column_removed_is_breaking():
    old_yaml = {"columns": {"a": "INT", "b": "STRING"}}
    new_yaml = {"columns": {"a": "INT"}}

    category = classify_change(old_yaml, new_yaml)
    assert category == ChangeCategory.BREAKING

def test_column_added_is_non_breaking():
    old_yaml = {"columns": {"a": "INT"}}
    new_yaml = {"columns": {"a": "INT", "b": "STRING"}}

    category = classify_change(old_yaml, new_yaml)
    assert category == ChangeCategory.NON_BREAKING

Troubleshooting

False positive: BREAKING but safe

Problem: Change categorized as BREAKING but downstream actually safe.

Solution: This is conservative by design. Review downstream nodes manually. If truly safe, the category is correct (downstream should rebuild to pick up new data).

False negative: NON_BREAKING but breaks downstream

Problem: Change categorized as NON_BREAKING but downstream queries fail.

Solution: This indicates an incomplete change detection rule. Report as a bug with example YAML files.

Metadata changes triggering rebuilds

Problem: Only changed description, but node still rebuilds.

Solution: Check if functional fields also changed (e.g., transform SQL whitespace differences are ignored, but logic changes are not).

Downstream rebuild list is incomplete

Problem: Expected node to rebuild but not in downstream list.

Solution: Check DAG edges. The node may not be a transitive dependent. Use seeknal plan to see full dependency graph.

Implementation Details

Source Files

  • src/seeknal/dag/diff.py - ManifestDiff and ChangeCategory
  • src/seeknal/workflow/environment.py - Plan with categorization
  • src/seeknal/cli/main.py - CLI diff command

Key Classes

class ChangeCategory(Enum):
    BREAKING = "breaking"
    NON_BREAKING = "non_breaking"
    METADATA = "metadata"

@dataclass
class NodeChange:
    node_id: str
    change_type: DiffType
    category: ChangeCategory
    changed_fields: list[str]
    changed_details: dict[str, str]

Algorithms

Column change classification: 1. Check if columns removed → BREAKING 2. Check if column types changed → BREAKING 3. Check if columns only added → NON_BREAKING

Config change classification: 1. Check if entity/source_type changed → BREAKING 2. Check if features removed → BREAKING 3. Check if inputs removed → BREAKING 4. Check if only audits/metadata changed → METADATA 5. Default → NON_BREAKING

Downstream BFS: 1. Start from changed node 2. Add all children to queue 3. For each child, recurse to their children 4. Mark all visited nodes for rebuild