Testing & Audits¶
Seeknal provides built-in data quality validation at two levels: audits for production-time quality checks on pipeline outputs, and feature validation for development-time quality checks on feature groups.
Audits (seeknal audit)¶
What Are Audits?¶
Audits run data quality checks against cached pipeline outputs (Parquet files in target/cache/) without re-executing the pipeline. They verify that your data meets quality expectations after transformation, enabling fast quality validation in CI/CD pipelines.
Key characteristics:
- Execute against cached Parquet files from the last pipeline run
- No re-execution needed - run instantly on cached data
- Defined inline in YAML node configurations
- Support severity levels (error vs. warn) for flexible failure handling
- Run via seeknal audit command
When to use audits: - Production data quality checks after pipeline execution - CI/CD quality gates before deploying transformations - Post-processing validation of transform outputs - Quick validation without re-running expensive transformations
Defining Audits in YAML¶
Add audit rules to any node's YAML configuration under the audits: section. Audits are defined as a list of rules, each with a type, required parameters, and optional severity.
Basic structure:
kind: transform
name: clean_users
transform: |
SELECT * FROM {{ ref('raw_users') }}
WHERE email IS NOT NULL
inputs:
- ref: source.raw_users
audits:
- type: <audit_type>
# ... type-specific parameters
severity: error # or 'warn' (default: error)
Built-in Audit Types¶
1. not_null - Check for NULL values¶
Verifies that specified columns contain no NULL values.
Parameters:
- columns (required): List of column names to check
- severity (optional): error or warn (default: error)
Example output:
2. unique - Check for duplicates¶
Verifies that specified column combinations are unique (no duplicate rows).
Parameters:
- columns (required): List of column names that should be unique together
- severity (optional): error or warn (default: error)
Example with composite key:
Example output:
PASS unique [user_id]: All values unique (72ms)
FAIL unique [user_id, event_date] (error): 5 duplicate groups (98ms)
3. accepted_values - Validate against allowed values¶
Checks that column values are within an allowed set (enum validation).
audits:
- type: accepted_values
columns: [status]
values: [active, inactive, pending]
severity: error
Parameters:
- columns (required): List of columns to check (typically one column)
- values (required): List of allowed values
- severity (optional): error or warn (default: error)
Example with multiple columns:
audits:
- type: accepted_values
columns: [order_status]
values: [pending, confirmed, shipped, delivered, cancelled]
severity: warn
Example output:
PASS accepted_values [status]: All values accepted (34ms)
FAIL accepted_values [order_status] (warn): 12 rows with invalid values (56ms)
4. row_count - Validate row count bounds¶
Checks that the output has a minimum and/or maximum number of rows.
Parameters:
- min (optional): Minimum allowed row count
- max (optional): Maximum allowed row count
- severity (optional): error or warn (default: error)
- Note: At least one of min or max must be specified
Example - minimum only:
Example output:
PASS row_count: Row count 5432 within bounds (12ms)
FAIL row_count (error): count 50 < min 100 (8ms)
5. custom_sql - Custom SQL validation¶
Execute custom SQL queries that return failing rows. The audit passes if the query returns zero rows.
Parameters:
- sql (required): SQL query that returns failing rows
- severity (optional): error or warn (default: error)
- Special placeholder: __THIS__ is replaced with the current table name
Example - check business logic:
audits:
- type: custom_sql
sql: |
SELECT * FROM __THIS__
WHERE order_total < 0
OR discount_percent > 100
OR quantity <= 0
severity: error
Example - referential integrity:
audits:
- type: custom_sql
sql: |
SELECT o.*
FROM __THIS__ o
LEFT JOIN source.customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL
severity: warn
Example output:
Complete Example¶
kind: transform
name: clean_orders
description: "Cleaned and validated order data"
transform: |
SELECT
order_id,
customer_id,
order_date,
order_status,
order_total,
CASE
WHEN order_status IN ('shipped', 'delivered') THEN 'fulfilled'
WHEN order_status = 'cancelled' THEN 'cancelled'
ELSE 'pending'
END as fulfillment_status
FROM source.raw_orders
WHERE order_date >= '2024-01-01'
inputs:
- ref: source.raw_orders
audits:
# Primary key check
- type: not_null
columns: [order_id]
severity: error
# Uniqueness check
- type: unique
columns: [order_id]
severity: error
# Required fields
- type: not_null
columns: [customer_id, order_date]
severity: error
# Enum validation
- type: accepted_values
columns: [order_status]
values: [pending, confirmed, shipped, delivered, cancelled]
severity: warn
# Sanity checks on data volume
- type: row_count
min: 1
severity: error
# Business logic validation
- type: custom_sql
sql: |
SELECT * FROM __THIS__
WHERE order_total < 0
OR order_date > CURRENT_DATE
severity: error
Running Audits¶
Audit all nodes¶
Runs audits on all nodes that have audits: defined. Scans target/cache/ for Parquet files and executes configured audit rules.
Example output:
Auditing source.users:
PASS not_null [user_id, email]: All values non-null (45ms)
PASS unique [user_id]: All values unique (72ms)
Auditing transform.clean_orders:
PASS not_null [order_id]: All values non-null (23ms)
PASS unique [order_id]: All values unique (67ms)
FAIL accepted_values [order_status] (warn): 3 rows with invalid values (41ms)
PASS row_count: Row count 1543 within bounds (8ms)
Audit Summary: 5 passed, 1 failed
Audit specific node¶
Runs audits only on the specified node. Use the fully qualified node name: <kind>.<name>.
Common workflow¶
# 1. Run the pipeline
seeknal run
# 2. Run audits on cached outputs
seeknal audit
# 3. If audits fail, inspect the data in the REPL
seeknal repl
# Then run: SELECT * FROM transform.clean_orders LIMIT 10
Audit Output Format¶
Audit results show:
- Status: PASS (green) or FAIL (red)
- Audit type: The audit rule type
- Columns: Affected columns (if applicable)
- Severity: (error) or (warn) shown only on failures
- Message: Details about the failure
- Duration: Execution time in milliseconds
Exit codes:
- 0 - All audits passed
- 1 - One or more audits failed (regardless of severity)
Note: Both severity: error and severity: warn failures contribute to the exit code. The severity level is displayed in the output to help you distinguish between critical and non-critical issues.
CI/CD Integration¶
Add audits to your CI/CD pipeline:
# .github/workflows/data-quality.yml
name: Data Quality Checks
on: [push, pull_request]
jobs:
audit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install Seeknal
run: pip install seeknal
- name: Run pipeline
run: seeknal run
- name: Run audits
run: seeknal audit
Feature Validation (seeknal validate-features)¶
What Is Feature Validation?¶
Feature validation provides development-time quality checks for feature groups. Unlike audits (which run on pipeline outputs), feature validation runs on feature group DataFrames using configurable validators.
Key characteristics:
- Runs on PySpark DataFrames during feature group development
- Configured programmatically via ValidationConfig
- Supports complex validators (null, range, uniqueness, freshness, custom)
- Two modes: warn (log failures) or fail (raise exception)
- Run via seeknal validate-features command
When to use feature validation: - Feature engineering data quality checks - ML pipeline validation before model training - Development-time debugging of data issues - Ensuring feature group schema compliance
Built-in Validators¶
Seeknal provides five built-in validators in seeknal.feature_validation.validators:
1. NullValidator - Detect NULL values¶
Checks columns for null values with configurable threshold.
from seeknal.feature_validation.validators import NullValidator
# No nulls allowed (default)
validator = NullValidator(columns=["user_id", "email"])
# Allow up to 5% nulls
validator = NullValidator(
columns=["age", "phone"],
max_null_percentage=0.05
)
Parameters:
- columns (List[str]): Columns to check
- max_null_percentage (float): Maximum allowed null percentage (0.0-1.0, default: 0.0)
2. RangeValidator - Validate numeric bounds¶
Checks that numeric values fall within min/max bounds.
from seeknal.feature_validation.validators import RangeValidator
# Age must be between 0 and 120
validator = RangeValidator(
column="age",
min_val=0,
max_val=120
)
# Price must be non-negative (min only)
validator = RangeValidator(column="price", min_val=0)
# Percentage must not exceed 100 (max only)
validator = RangeValidator(column="discount_percent", max_val=100)
Parameters:
- column (str): Column to validate
- min_val (float, optional): Minimum value (inclusive)
- max_val (float, optional): Maximum value (inclusive)
- Note: At least one of min_val or max_val required
3. UniquenessValidator - Detect duplicates¶
Checks for duplicate rows based on specified columns.
from seeknal.feature_validation.validators import UniquenessValidator
# Strict uniqueness (no duplicates)
validator = UniquenessValidator(columns=["user_id"])
# Allow up to 1% duplicates
validator = UniquenessValidator(
columns=["user_id", "event_date"],
max_duplicate_percentage=0.01
)
Parameters:
- columns (List[str]): Columns defining uniqueness (composite key)
- max_duplicate_percentage (float): Maximum allowed duplicate percentage (0.0-1.0, default: 0.0)
4. FreshnessValidator - Check timestamp recency¶
Validates that timestamps are within a maximum age from current time.
from datetime import timedelta
from seeknal.feature_validation.validators import FreshnessValidator
# Timestamps must be within last 24 hours
validator = FreshnessValidator(
column="event_time",
max_age=timedelta(hours=24)
)
# With custom reference time
from datetime import datetime
validator = FreshnessValidator(
column="updated_at",
max_age=timedelta(days=7),
reference_time=datetime(2024, 1, 1, 12, 0, 0)
)
Parameters:
- column (str): Timestamp column to check
- max_age (timedelta): Maximum allowed age
- reference_time (datetime, optional): Reference time (default: current time)
5. CustomValidator - User-defined validation¶
Wrap custom validation functions for flexible validation logic.
from seeknal.feature_validation.validators import CustomValidator
from pyspark.sql import DataFrame
# Simple boolean function
def check_positive_values(df: DataFrame) -> bool:
return df.filter(F.col("value") < 0).count() == 0
validator = CustomValidator(
func=check_positive_values,
name="positive_values_check"
)
# Lambda function
validator = CustomValidator(
func=lambda df: df.count() > 0,
name="non_empty_check"
)
# Function returning ValidationResult for detailed control
from seeknal.feature_validation.models import ValidationResult
def check_with_details(df: DataFrame) -> ValidationResult:
count = df.filter(F.col("status") == "error").count()
total = df.count()
return ValidationResult(
validator_name="error_check",
passed=count == 0,
failure_count=count,
total_count=total,
message=f"Found {count} error records"
)
validator = CustomValidator(
func=check_with_details,
name="error_check",
description="Check for error status records"
)
Parameters:
- func (Callable): Validation function returning bool or ValidationResult
- name (str): Validator name (default: "CustomValidator")
- description (str, optional): Human-readable description
Configuring Feature Validation¶
Feature validation is configured programmatically via ValidationConfig:
from seeknal.featurestore.feature_group import FeatureGroup
from seeknal.feature_validation.models import ValidationConfig, ValidationMode
from seeknal.feature_validation.validators import (
NullValidator,
RangeValidator,
UniquenessValidator
)
# Create feature group
fg = FeatureGroup.load("user_features")
# Configure validators
validation_config = ValidationConfig(
mode=ValidationMode.FAIL, # or ValidationMode.WARN
validators=[
NullValidator(columns=["user_id", "signup_date"]),
UniquenessValidator(columns=["user_id"]),
RangeValidator(column="age", min_val=0, max_val=120),
RangeValidator(column="ltv_score", min_val=0, max_val=100)
],
enabled=True
)
# Attach to feature group
fg.set_validation_config(validation_config)
fg.save()
Validation Modes¶
Feature validation supports two modes:
| Mode | Behavior | Exit Code | Use Case |
|---|---|---|---|
WARN |
Log failures but continue execution | 0 | Development, exploratory analysis |
FAIL |
Raise exception on first failure | 1 | Production, CI/CD pipelines |
Example - WARN mode:
from seeknal.feature_validation.models import ValidationMode
validation_config = ValidationConfig(
mode=ValidationMode.WARN,
validators=[...]
)
All validators execute even if some fail. Failures are logged but don't halt execution.
Example - FAIL mode (default):
from seeknal.feature_validation.models import ValidationMode
validation_config = ValidationConfig(
mode=ValidationMode.FAIL, # Default
validators=[...]
)
Execution stops at the first validator failure, raising ValidationException.
Running Feature Validation¶
Via CLI¶
# Run in FAIL mode (default)
seeknal validate-features user_features
# Run in WARN mode
seeknal validate-features user_features --mode warn
# Verbose output
seeknal validate-features user_features --mode fail --verbose
Example output:
Validating feature group: user_features
Mode: fail
Validators to run: 4
- NullValidator
- UniquenessValidator
- RangeValidator
- RangeValidator
Running validators...
PASS NullValidator - Null check passed for columns ['user_id', 'signup_date']
PASS UniquenessValidator - All 5432 rows are unique
PASS RangeValidator - All 5432 non-null values within range [0, 120]
FAIL RangeValidator - 23 values (0.42%) outside range [0, 100]
Validation stopped: Range check failed for column 'ltv_score'
Programmatically¶
from seeknal.featurestore.feature_group import FeatureGroup
from seeknal.feature_validation.models import ValidationMode
from seeknal.feature_validation.validators import (
ValidationRunner,
NullValidator,
UniquenessValidator
)
# Load feature group
fg = FeatureGroup.load("user_features")
# Create validators
validators = [
NullValidator(columns=["user_id"]),
UniquenessValidator(columns=["user_id"])
]
# Run validation
try:
summary = fg.validate(validators=validators, mode=ValidationMode.FAIL)
print(f"Validation passed: {summary.passed_count}/{summary.total_validators}")
except ValidationException as e:
print(f"Validation failed: {e.message}")
YAML Tests Block¶
Overview¶
In addition to inline audits:, Seeknal supports a simplified tests: block in YAML for basic data quality checks on feature groups. This provides a dbt-like experience.
Supported in: - Feature groups - Aggregations
Syntax:
kind: feature_group
name: user_features
# ... other configuration
tests:
- not_null: [user_id, signup_date]
- unique: [user_id]
Supported Test Types¶
not_null - Verify no NULL values¶
Checks that the specified columns contain no NULL values.
unique - Verify uniqueness¶
Checks that the specified columns are unique (no duplicates).
Composite key:
Complete Example¶
kind: feature_group
name: customer_lifetime_features
description: "Customer lifetime value features"
entity:
name: customer
join_keys: [customer_id]
materialization:
event_time_col: last_order_date
offline:
enabled: true
transform: |
SELECT
customer_id,
MIN(order_date) as first_order_date,
MAX(order_date) as last_order_date,
COUNT(*) as order_count,
SUM(order_total) as lifetime_value
FROM source.raw_orders
GROUP BY customer_id
inputs:
- ref: source.raw_orders
features:
first_order_date:
description: "Date of first order"
dtype: date
last_order_date:
description: "Date of most recent order"
dtype: date
order_count:
description: "Total number of orders"
dtype: int
lifetime_value:
description: "Total spend across all orders"
dtype: float
tests:
- not_null: [customer_id]
- unique: [customer_id]
Comparison to Other Tools¶
| Feature | Seeknal | dbt | SQLMesh |
|---|---|---|---|
| Audit Definition | audits: YAML block |
tests: in schema.yml |
audits: in model SQL |
| Built-in Tests | 5 types (not_null, unique, accepted_values, row_count, custom_sql) | 4 types (not_null, unique, accepted_values, relationships) | Similar to Seeknal |
| Custom SQL Tests | custom_sql audit type |
Custom test macros via dbt test |
audit with SQL expressions |
| Runtime Checks | seeknal audit |
dbt test |
sqlmesh audit |
| Severity Levels | error, warn |
error, warn |
error |
| Execution | Runs on cached Parquet files | Re-executes models | Runs on materialized tables |
| Feature Validation | Built-in validators for PySpark DataFrames | Not built-in | Not built-in |
| Validation Modes | warn, fail |
N/A | N/A |
| CLI Command | seeknal audit, seeknal validate-features |
dbt test |
sqlmesh audit |
Best Practices¶
1. Use Audits for Fast Quality Gates¶
Audits run instantly on cached data, making them ideal for CI/CD pipelines:
audits:
- type: not_null
columns: [id, created_at]
severity: error
- type: row_count
min: 1
severity: error
2. Set Appropriate Severity Levels¶
Use error for critical issues, warn for informational checks:
audits:
# Critical - must pass
- type: unique
columns: [user_id]
severity: error
# Informational - log but don't fail
- type: accepted_values
columns: [country_code]
values: [US, UK, CA, AU, DE, FR]
severity: warn
3. Combine Multiple Audit Types¶
Layer audits for comprehensive validation:
audits:
# 1. Schema validation
- type: not_null
columns: [order_id, customer_id, order_date]
severity: error
# 2. Data integrity
- type: unique
columns: [order_id]
severity: error
# 3. Business logic
- type: custom_sql
sql: |
SELECT * FROM __THIS__
WHERE order_total < 0 OR quantity <= 0
severity: error
# 4. Volume sanity check
- type: row_count
min: 100
severity: warn
4. Use Feature Validation for ML Pipelines¶
Validate feature groups before training:
validators = [
NullValidator(columns=["user_id", "features"]),
RangeValidator(column="age", min_val=0, max_val=120),
FreshnessValidator(column="updated_at", max_age=timedelta(days=1))
]
summary = fg.validate(validators=validators, mode=ValidationMode.FAIL)
if summary.passed:
# Proceed with training
train_model(fg)
5. Document Custom Audits¶
Add comments to custom SQL audits for maintainability:
audits:
# Check referential integrity: all orders must have valid customer_id
- type: custom_sql
sql: |
SELECT o.*
FROM __THIS__ o
LEFT JOIN source.customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL
severity: error
Troubleshooting¶
Audit fails: "Table does not exist"¶
Cause: No cached data for the node.
Solution: Run the pipeline first:
Feature validation: "No validators configured"¶
Cause: Feature group has no ValidationConfig set.
Solution: Configure validators:
Audit passes but data looks wrong¶
Cause: Audits only check what you tell them to check.
Solution: Add more specific custom SQL audits:
audits:
- type: custom_sql
sql: |
SELECT * FROM __THIS__
WHERE your_specific_condition
severity: error
Slow audit execution¶
Cause: Large cached Parquet files.
Solution:
- Optimize transforms to output fewer rows
- Use row_count with max to catch runaway queries early
- Consider partitioning large outputs
Summary¶
Seeknal's testing and audits system provides:
- Audits: Production-time quality checks on cached outputs via
seeknal audit - Feature Validation: Development-time quality checks on feature groups via
seeknal validate-features - YAML Tests: Simple dbt-like test syntax for basic checks
- Five audit types: not_null, unique, accepted_values, row_count, custom_sql
- Five validators: NullValidator, RangeValidator, UniquenessValidator, FreshnessValidator, CustomValidator
- Flexible severity: error vs. warn for different failure handling
- CI/CD ready: Fast execution on cached data with clear exit codes
Use audits for fast quality gates in production pipelines, and feature validation for comprehensive ML pipeline data quality checks.
See Also¶
- Concepts: Audit, Validation, Rule
- Reference: CLI Commands, YAML Schema - Audits
- Tutorials: YAML Pipeline Tutorial