Skip to content

Chapter 3: Data Rules

Duration: 25 minutes | Difficulty: Beginner | Format: YAML & CLI

Learn to validate data quality with automated rule checks in your Seeknal pipeline.


What You'll Build

Three rule nodes that validate your pipeline data:

transform.events_cleaned ──→ rule.not_null_quantity    (null check)
                         ──→ rule.positive_quantity    (range check)
source.products          ──→ rule.valid_prices         (range on source)

After this chapter, you'll have: - Rules that check for null values, valid ranges, and data integrity - Understanding of severity: error vs severity: warn - A complete pipeline with both transforms and validation rules


Prerequisites

Before starting, ensure you've completed:


Part 1: Null & Range Rules (8 minutes)

Understanding Rules

Rules are separate YAML nodes that validate data from upstream sources or transforms. They use the same inputs: and ref: syntax as transforms, but instead of producing new tables, they produce pass/fail results.

Key concepts: - rule.type: — The kind of validation: null, range, uniqueness, or freshness - rule.columns: or rule.column: — Which column(s) to validate - rule.params: — Type-specific thresholds (e.g., min_val, max_null_percentage) - severity: error — Fails the pipeline run if the rule is violated - severity: warn — Logs a warning but continues the pipeline run

Check for Null Quantities

Even though events_cleaned filters nulls, let's add a rule as a safety net:

seeknal draft rule not_null_quantity

Edit draft_rule_not_null_quantity.yml:

kind: rule
name: not_null_quantity
description: "Ensure quantity is never null in cleaned events"
inputs:
  - ref: transform.events_cleaned
rule:
  type: "null"
  columns:
    - quantity
  params:
    max_null_percentage: 0.0
params:
  severity: error
  error_message: "Null quantity found in cleaned events"

Rule Type: null

The null type checks for null values across the listed columns. max_null_percentage: 0.0 means zero nulls are allowed. Setting it to 0.05 would allow up to 5% null rows.

Rules Need inputs:

Just like transforms, rules must declare their upstream dependency in inputs:. The rule evaluates against the data from that node.

Check for Positive Quantities

seeknal draft rule positive_quantity

Edit draft_rule_positive_quantity.yml:

kind: rule
name: positive_quantity
description: "Ensure all quantities are positive"
inputs:
  - ref: transform.events_cleaned
rule:
  type: range
  column: quantity
  params:
    min_val: 1
params:
  severity: error
  error_message: "Non-positive quantity detected"

Rule Type: range

The range type checks that values in column: fall within min_val and/or max_val. Here, min_val: 1 ensures all quantities are at least 1. You can use min_val, max_val, or both.

Validate and Apply Both Rules

seeknal dry-run draft_rule_not_null_quantity.yml
seeknal apply draft_rule_not_null_quantity.yml

seeknal dry-run draft_rule_positive_quantity.yml
seeknal apply draft_rule_positive_quantity.yml

Checkpoint: Both rules should apply successfully. They're now part of your project at seeknal/rules/.


Part 2: Source Validation & Severity Levels (7 minutes)

Validate Product Prices

Rules can check source data too, not just transforms. Let's validate that product prices are reasonable:

seeknal draft rule valid_prices

Edit draft_rule_valid_prices.yml:

kind: rule
name: valid_prices
description: "Ensure product prices are within valid range"
inputs:
  - ref: source.products
rule:
  type: range
  column: price
  params:
    min_val: 0
    max_val: 10000
params:
  severity: warn
  error_message: "Product price outside valid range (0-10000)"

Error vs Warn

This rule uses severity: warn instead of error. Here's the difference:

Severity Behavior Use When
error Fails the pipeline run Data must be valid (e.g., no nulls in required fields)
warn Logs warning, continues Data should be valid but pipeline can proceed (e.g., price sanity checks)

Use error for hard constraints and warn for soft guidelines.

seeknal dry-run draft_rule_valid_prices.yml
seeknal apply draft_rule_valid_prices.yml

Multiple Checks in One Rule

You can combine multiple validation checks in a single rule file by using a list under rule:. This keeps related checks together:

kind: rule
name: product_quality
description: "Validate product data quality"
inputs:
  - ref: source.products
rule:
  - type: "null"
    columns: [name, price]
    params:
      max_null_percentage: 0.0
  - type: range
    column: price
    params:
      min_val: 0
      max_val: 10000
params:
  severity: warn
  error_message: "Product data quality check failed"

All checks run against the same input data. If any check fails, the rule fails (respecting the severity setting). The results metadata includes details for each individual check.

Single vs Multi-Check

Both formats work — use whichever fits your needs:

Format When to Use
rule: {type: ...} Single focused validation
rule: [{type: ...}, ...] Multiple related validations on the same input

Part 2b: SQL Assertion Rules (5 minutes)

Custom SQL Validation

For checks that don't fit the built-in rule types (null, range, uniqueness, freshness), use sql_assertion. This is a dbt-style test — you write a SQL query that returns "bad rows". If the query returns any rows, the rule fails.

seeknal draft rule no_duplicate_events

Edit draft_rule_no_duplicate_events.yml:

kind: rule
name: no_duplicate_events
description: "Ensure no duplicate event_id values exist"
inputs:
  - ref: transform.events_cleaned
rule:
  type: sql_assertion
  sql: |
    SELECT event_id, COUNT(*) as cnt
    FROM input_data
    GROUP BY event_id
    HAVING cnt > 1
params:
  severity: error
  error_message: "Duplicate event_id found in cleaned events"

Rule Type: sql_assertion

The sql_assertion type runs your SQL query against the input data. If the query returns zero rows, the rule passes. If it returns any rows, those are violations and the rule fails.

The first input is automatically available as input_data. You can write any SQL — aggregations, subqueries, window functions — as long as it returns the "bad rows" you want to detect.

Cross-Table Assertions

sql_assertion supports multiple inputs. Each input ref becomes a named view (dots replaced with underscores):

kind: rule
name: valid_order_references
description: "Ensure all orders reference valid products"
inputs:
  - ref: source.sales_events
  - ref: source.products
rule:
  type: sql_assertion
  sql: |
    SELECT e.event_id, e.product_id
    FROM source_sales_events e
    LEFT JOIN source_products p ON e.product_id = p.product_id
    WHERE p.product_id IS NULL
params:
  severity: warn
  error_message: "Orders reference non-existent products"

Named Views from Inputs

Each ref: in inputs: creates a DuckDB view named by replacing dots with underscores:

Input ref View name
source.sales_events source_sales_events
source.products source_products
transform.events_cleaned transform_events_cleaned

The first input is also available as input_data for backward compatibility.

Aggregate Checks

Use sql_assertion for aggregate validations that the built-in types don't cover:

kind: rule
name: reasonable_avg_price
description: "Ensure average product price is reasonable"
inputs:
  - ref: source.products
rule:
  type: sql_assertion
  sql: |
    SELECT 1
    WHERE (SELECT AVG(price) FROM input_data) > 5000
params:
  severity: warn

Apply and test:

seeknal dry-run draft_rule_no_duplicate_events.yml
seeknal apply draft_rule_no_duplicate_events.yml

Checkpoint: The sql_assertion rule should apply successfully and pass when running seeknal run.


Part 3: Running Rules in Your Pipeline (5 minutes)

View the Full DAG

Now let's see the complete pipeline with all transforms and rules:

seeknal plan

You should see a DAG with 9 nodes:

source.products
source.sales_events
source.sales_snapshot
transform.events_cleaned        (depends on: source.sales_events)
transform.sales_enriched        (depends on: transform.events_cleaned, source.products)
transform.sales_summary         (depends on: transform.sales_enriched)
rule.not_null_quantity           (depends on: transform.events_cleaned)
rule.positive_quantity           (depends on: transform.events_cleaned)
rule.valid_prices                (depends on: source.products)

Rules appear as leaf nodes — they depend on their input but nothing depends on them.

Run the Full Pipeline

seeknal run

Watch the output — you'll see each node execute in dependency order. The rules run after their inputs and report validation results:

Expected output:

...
rule.valid_prices: SUCCESS in 0.02s (6 rows)
rule.not_null_quantity: SUCCESS in 0.01s (5 rows)
rule.positive_quantity: SUCCESS in 0.01s (5 rows)
...

All rules pass because: - events_cleaned already filters null and deduplicates, so not_null_quantity and positive_quantity pass - All product prices are between 0 and 10,000, so valid_prices passes

Test with Failing Data

To see what happens when a rule fails, temporarily add a bad row to your products:

echo 'PRD-007,Broken Item,Test,-5.00,2026-01-01' >> data/products.csv
seeknal run

The valid_prices rule will now log a warning (since it uses severity: warn). The pipeline continues running.

If not_null_quantity or positive_quantity fail (they use severity: error), the pipeline would stop at that point.

After testing, remove the bad row:

# Restore original products.csv (keep only first 7 lines: header + 6 data rows)
head -7 data/products.csv > data/products_clean.csv
mv data/products_clean.csv data/products.csv

Checkpoint: Your full pipeline runs successfully with all 9 nodes — 3 sources, 3 transforms, and 3 rules.


What Could Go Wrong?

Common Pitfalls

1. Rule references a column that doesn't exist

  • Symptom: Rule fails with "column not found" error
  • Fix: Check that the column in rule.column: or rule.columns: exists in the output of the node referenced in inputs:. Use seeknal repl to query the upstream node and verify column names.

2. Missing inputs: in rule YAML

  • Symptom: Rule doesn't know what data to validate
  • Fix: Every rule needs inputs: to declare which node it validates. Add inputs: [ref: source.my_source] or inputs: [ref: transform.my_transform].

3. severity: error stops the entire pipeline

  • Symptom: Downstream nodes don't execute
  • Fix: This is by design. Use severity: warn if you want the pipeline to continue despite violations. Reserve error for hard constraints.

4. Quoting the null type

  • Symptom: YAML parse error on type: null
  • Fix: In YAML, unquoted null is interpreted as a null value. Always quote it: type: "null".

Summary

In this chapter, you learned:

  • Rule Nodes — Validate data quality with YAML rule definitions
  • Rule Typesnull checks for missing values, range checks numeric bounds
  • SQL Assertions — Custom SQL queries for arbitrary validation logic
  • Severity Levelserror stops the pipeline, warn continues with a warning
  • Source Validation — Rules can check source data, not just transforms
  • Pipeline Integration — Rules execute as part of seeknal run in DAG order

Supported Rule Types:

Type Purpose Key Params
null Check for null/missing values columns:, max_null_percentage:
range Check numeric bounds column:, min_val:, max_val:
uniqueness Check for duplicates columns:, max_duplicate_percentage:
freshness Check timestamp recency column:, max_age: (e.g., "24h", "7d")
sql_assertion Custom SQL (returns bad rows) sql: (query), multiple inputs: supported

Rule YAML Structure:

# Single check
kind: rule
name: my_rule
description: "What this rule checks"
inputs:
  - ref: source.my_source     # or transform.my_transform
rule:
  type: range                  # null, range, uniqueness, freshness
  column: my_column            # single column (range, freshness)
  # columns: [col1, col2]     # multiple columns (null, uniqueness)
  params:
    min_val: 0                 # type-specific thresholds
    max_val: 100
params:
  severity: error              # error or warn
  error_message: "Description of violation"
# Multiple checks in one rule
kind: rule
name: my_combined_rule
inputs:
  - ref: source.my_source
rule:
  - type: "null"
    columns: [col1, col2]
    params:
      max_null_percentage: 0.0
  - type: range
    column: col1
    params:
      min_val: 0
      max_val: 100
params:
  severity: error
# SQL assertion (dbt-style: returns bad rows)
kind: rule
name: my_sql_check
inputs:
  - ref: source.my_source
  - ref: source.my_other_source    # optional: cross-table checks
rule:
  type: sql_assertion
  sql: |
    SELECT id FROM input_data
    WHERE some_condition
params:
  severity: error

Key Commands:

seeknal draft rule <name>            # Generate rule template
seeknal dry-run <draft_file>.yml     # Preview validation
seeknal apply <draft_file>.yml       # Apply to project
seeknal plan                         # View DAG (rules appear as leaf nodes)
seeknal run                          # Execute pipeline with rule checks


What's Next?

In Chapter 4: Lineage & Inspection, you'll visualize your pipeline's data flow and learn to inspect intermediate outputs for debugging.


See Also