Getting Started with Seeknal¶
Estimated Time: 30 minutes | Difficulty: Beginner
This comprehensive guide takes you from installation to your first materialized feature group. By the end, you'll understand how Seeknal simplifies feature engineering for ML applications.
Table of Contents¶
- Why Seeknal?
- What You'll Learn
- Prerequisites
- Installation
- Development Approaches
- Quick Start Tutorial
- Part 1: Load and Explore Data
- Part 2: Feature Engineering
- Part 3: Save and Analyze Features
- Core Concepts
- DuckDB vs Spark: When to Use Each
- Next Steps
- Troubleshooting
Why Seeknal?¶
Seeknal is an all-in-one platform for data and AI/ML engineering that abstracts away the complexity of data transformation and feature engineering.
Key Benefits¶
| Feature | Benefit |
|---|---|
| Simple Setup | Uses SQLite for metadata - no external database required |
| Dual Engine Support | DuckDB for development, Spark for production |
| Pythonic API | Define transformations with familiar Python and SQL |
| Feature Reuse | Register and share features across teams |
| Point-in-Time Correctness | Automatic handling to prevent data leakage |
How Seeknal Compares¶
Unlike other feature stores that require complex infrastructure setup, Seeknal:
- Starts simple: SQLite metadata + DuckDB processing = zero infrastructure
- Scales when needed: Same code works with Spark for production
- Stays flexible: Mix and match processing engines as needed
What You'll Learn¶
By completing this guide, you will:
- Install Seeknal and initialize a project
- Define data sources with YAML pipeline definitions
- Create transforms using SQL-based data transformations
- Build feature groups for ML-ready feature engineering
- Run the pipeline with the
draft → dry-run → apply → runworkflow - Explore results using the interactive REPL
Note: This tutorial uses Seeknal's YAML pipeline workflow — a dbt-inspired approach with CLI commands. For Python decorator-based pipelines, see the Python Pipelines Tutorial.
Prerequisites¶
Before starting, ensure you have:
| Requirement | Version | Check Command |
|---|---|---|
| Python | 3.11 or higher | python --version |
| pip | Latest | pip --version |
| uv (recommended) | Latest | uv --version |
Optional but Recommended¶
- Jupyter Notebook - For interactive exploration
- VS Code or PyCharm - For running Python scripts
Installation¶
Estimated Time: 5 minutes
Choose your preferred installation method based on your operating system.
Install from PyPI¶
This installs Seeknal and all required dependencies (DuckDB, pandas, pyarrow, etc.).
Verify Installation¶
Run this command to verify Seeknal is installed correctly:
Expected Output:
Configure Environment (Optional)¶
Seeknal works out of the box with sensible defaults. Configuration is not required for this tutorial.
For advanced usage (custom config location, database backends, or team environments), you can override the default paths:
# Optional: Override default config directory (default: ~/.seeknal/)
export SEEKNAL_BASE_CONFIG_PATH="$HOME/.seeknal"
# Optional: Override user config file (default: ~/.seeknal/config.toml)
export SEEKNAL_USER_CONFIG_PATH="$HOME/.seeknal/config.toml"
When would you need this?
- Custom install paths — if your home directory is not writable or you want config in a shared location
- Multiple environments — separate configs for dev/staging/prod via different
SEEKNAL_BASE_CONFIG_PATHvalues - CI/CD pipelines — point to ephemeral config directories
Seeknal auto-creates ~/.seeknal/ and its SQLite database on first use, so most users never need to set these.
Quick Start Tutorial¶
Total Estimated Time: 15 minutes
In this tutorial, you'll build a data pipeline that loads CSV data, cleans it, and creates user features — all using YAML definitions and the CLI.
Step 1: Initialize a Project (2 minutes)¶
seeknal init --name ecommerce_analytics --description "E-commerce user features"
cd ecommerce_analytics
This creates the project structure:
ecommerce_analytics/
├── seeknal_project.yml # Project configuration
├── profiles.yml # Connection credentials (gitignored)
├── seeknal/
│ ├── sources/ # Data source definitions
│ ├── transforms/ # SQL transformations
│ └── feature_groups/ # Feature engineering
└── target/ # Pipeline outputs
Step 2: Create a Data Source (3 minutes)¶
Generate a source template:
Edit the generated draft_source_raw_orders.yml:
kind: source
name: raw_orders
description: "Raw e-commerce order data"
tags: ["orders", "ecommerce"]
source: csv
table: "data/orders.csv"
columns:
order_id: "Unique order identifier"
customer_id: "Customer who placed the order"
order_date: "Date the order was placed"
product: "Product name"
quantity: "Number of items"
price: "Unit price"
total: "Order total amount"
Validate and preview:
Apply to register the source:
Step 3: Create a Transform (3 minutes)¶
Generate a transform template:
Edit draft_transform_clean_orders.yml:
kind: transform
name: clean_orders
description: "Remove invalid orders and add computed columns"
input:
- ref('source.raw_orders')
query: |
SELECT
order_id,
customer_id,
CAST(order_date AS DATE) AS order_date,
product,
quantity,
price,
total,
(quantity * price) AS computed_total
FROM input_0
WHERE
quantity > 0
AND price > 0
AND customer_id IS NOT NULL
Apply it:
Step 4: Create a Feature Group (3 minutes)¶
Generate a feature group template:
Edit draft_feature_group_customer_features.yml:
kind: feature_group
name: customer_features
description: "Customer-level purchase behavior features"
entity: customer
entity_keys:
- customer_id
input:
- ref('transform.clean_orders')
query: |
SELECT
customer_id,
COUNT(DISTINCT order_id) AS order_count,
SUM(total) AS total_spend,
AVG(total) AS avg_order_value,
MAX(total) AS max_order_value,
MIN(order_date) AS first_order_date,
MAX(order_date) AS last_order_date,
COUNT(DISTINCT product) AS unique_products
FROM input_0
GROUP BY customer_id
materialization:
format: parquet
Apply it:
seeknal dry-run draft_feature_group_customer_features.yml
seeknal apply draft_feature_group_customer_features.yml
Step 5: Run the Pipeline (2 minutes)¶
Preview the execution plan, then run:
Expected output:
Seeknal Pipeline Execution
============================================================
ℹ Building DAG from seeknal/ directory...
✓ DAG built: 3 nodes, 2 edges
Execution Plan:
1. RUN raw_orders [orders, ecommerce]
2. RUN clean_orders
3. RUN customer_features
Execution
============================================================
1/3: raw_orders [RUNNING]
SUCCESS in 0.5s — Rows: 1,000
2/3: clean_orders [RUNNING]
SUCCESS in 0.3s — Rows: 985
3/3: customer_features [RUNNING]
SUCCESS in 0.4s — Rows: 200
✓ All nodes executed successfully
Step 6: Explore Results (2 minutes)¶
Use the interactive REPL to query your pipeline outputs:
-- The REPL auto-registers all pipeline outputs
-- Query your feature group directly:
SELECT * FROM customer_features ORDER BY total_spend DESC LIMIT 5;
-- Explore the intermediate transform:
SELECT * FROM clean_orders LIMIT 10;
-- Run ad-hoc analysis:
SELECT
CASE
WHEN order_count = 1 THEN 'One-time'
WHEN order_count <= 3 THEN 'Occasional'
ELSE 'Frequent'
END AS segment,
COUNT(*) AS customer_count,
ROUND(AVG(total_spend), 2) AS avg_spend
FROM customer_features
GROUP BY 1
ORDER BY customer_count DESC;
Output files are in target/intermediate/:
ls target/intermediate/
# feature_group_customer_features.parquet
# transform_clean_orders.parquet
# source_raw_orders.parquet
Core Concepts¶
Pipeline Nodes¶
Seeknal pipelines are built from nodes — YAML or Python files that define each step of your data workflow:
| Node Type | Purpose | Example |
|---|---|---|
| Source | Load raw data (CSV, PostgreSQL, etc.) | kind: source |
| Transform | SQL-based data transformations | kind: transform |
| Feature Group | ML feature engineering with entity keys | kind: feature_group |
| Model | ML model training/inference | kind: model |
| Exposure | Export results to external systems | kind: exposure |
DAG (Directed Acyclic Graph)¶
Nodes are connected via ref() references, forming a DAG that Seeknal executes in dependency order:
CLI Workflow¶
The core workflow follows a dbt-inspired pattern:
seeknal draft <type> <name> # Generate a YAML template
seeknal dry-run <file> # Validate and preview
seeknal apply <file> # Register the node
seeknal run # Execute the full DAG
seeknal repl # Query results interactively
Materialization¶
Pipeline outputs are saved as Parquet files by default. You can also materialize to:
- PostgreSQL —
type: postgresqlwith modes:full,incremental_by_time,upsert_by_key - Apache Iceberg —
type: icebergfor lakehouse storage with time travel
Environments¶
Isolate dev/staging/prod with environment-aware execution:
seeknal env plan dev # Create a dev environment plan
seeknal env apply dev # Execute with environment isolation
Using Iceberg Storage¶
Apache Iceberg provides ACID transactions, time travel, and cloud-native storage for feature groups. With Iceberg, you can store features in S3, GCS, or Azure Blob storage while maintaining full version history and rollback capabilities.
Prerequisites¶
For Iceberg storage, you need: - DuckDB with Iceberg extension (included with Seeknal) - REST Catalog (e.g., Lakekeeper, Hive Metastore) - Cloud storage (S3, GCS, Azure) or local filesystem
Configure Catalog¶
Create or update ~/.seeknal/profiles.yml:
materialization:
catalog:
uri: http://localhost:8181 # Lakekeeper REST catalog
warehouse: s3://my-bucket/warehouse
bearer_token: optional_token # If auth required
Or use environment variables:
Create Feature Group with Iceberg¶
from seeknal.featurestore import (
FeatureGroup,
Materialization,
OfflineMaterialization,
OfflineStore,
OfflineStoreEnum,
IcebergStoreOutput,
)
from seeknal.entity import Entity
from datetime import datetime
import pandas as pd
# Create sample data
df = pd.DataFrame({
"customer_id": ["A001", "A002", "A003"],
"event_date": [datetime(2024, 1, 1)] * 3,
"total_orders": [5, 10, 3],
"total_spend": [100.0, 250.0, 75.0],
})
# Create entity
customer_entity = Entity(name="customer", join_keys=["customer_id"])
# Create feature group with Iceberg storage
fg = FeatureGroup(
name="customer_features",
entity=customer_entity,
materialization=Materialization(
event_time_col="event_date",
offline=True,
offline_materialization=OfflineMaterialization(
store=OfflineStore(
kind=OfflineStoreEnum.ICEBERG, # Use Iceberg!
value=IcebergStoreOutput(
catalog="lakekeeper", # Catalog name from profiles.yml
namespace="prod", # Iceberg namespace (database)
table="customer_features", # Table name
mode="append" # "append" or "overwrite"
)
),
mode="append"
)
)
)
# Write features to Iceberg
fg.set_dataframe(df).set_features()
fg.write(feature_start_time=datetime(2024, 1, 1))
# Features are now in Iceberg table: lakekeeper.prod.customer_features
Query Iceberg Tables¶
After writing, query features with DuckDB or Spark:
import duckdb
# Query Iceberg table directly
con = duckdb.connect(":memory:")
# Load Iceberg extension
con.install_extension("iceberg")
con.load_extension("iceberg")
# Attach REST catalog
con.execute(f"""
ATTACH 'http://localhost:8181' AS iceberg_catalog (
TYPE iceberg,
WAREHOUSE 's3://my-bucket/warehouse'
)
""")
# Query features
result = con.execute("""
SELECT customer_id, total_orders, total_spend
FROM iceberg_catalog.prod.customer_features
""").fetchall()
print(result) # [('A001', 5, 100.0), ('A002', 10, 250.0), ...]
Time Travel with Iceberg¶
Iceberg supports time travel - query features as of any snapshot:
# Get snapshot ID from write result
result = fg.write(feature_start_time=datetime(2024, 1, 1))
snapshot_id = result["snapshot_id"]
# Query as of specific snapshot
con.execute(f"USE SNAPSHOT '{snapshot_id}'")
historical_features = con.execute("""
SELECT * FROM iceberg_catalog.prod.customer_features
""").fetchall()
Write Modes¶
Append Mode (default):
Overwrite Mode:
Benefits of Iceberg Storage¶
| Feature | Benefit |
|---|---|
| ACID Transactions | Atomic writes, no partial data |
| Time Travel | Query features as of any point in time |
| Schema Evolution | Add/modify features without rewrites |
| Cloud Storage | S3, GCS, Azure Blob support |
| Compatibility | Works with DuckDB, Spark, Trino, and more |
| Partitioning | Efficient data pruning for large datasets |
Next Steps¶
Tutorials¶
- YAML Pipeline Tutorial — Deep dive into the YAML workflow with a complete e-commerce dataset
- Python Pipelines Tutorial — Build pipelines using Python decorators (
@source,@transform,@feature_group) - E-commerce Workflow Tutorial — Full draft/dry-run/apply workflow with 5 tables
Guides¶
- Python Pipelines Guide — Decorator API reference and patterns
- Training to Serving — End-to-end ML feature lifecycle
- Testing and Audits — Data quality validation
- Semantic Layer — Metrics and materialized views
Reference¶
- CLI Reference — All CLI commands and options
- YAML Schema Reference — Complete YAML node specification
- Configuration — Profiles, connections, and settings
Community¶
- GitHub: mta-tech/seeknal
- Issues: Report bugs or request features
- PyPI:
pip install seeknal
Troubleshooting¶
Installation Issues¶
"ModuleNotFoundError: No module named 'seeknal'"¶
Cause: Seeknal is not installed in your active Python environment.
Solution:
# Check which Python you're using
which python
# Ensure virtual environment is activated
source .venv/bin/activate # Linux/macOS
.\.venv\Scripts\activate # Windows
# Reinstall Seeknal
pip install seeknal
"ImportError: cannot import name 'DuckDBTask'"¶
Cause: Incorrect import or outdated Seeknal version.
Solution:
# Verify Seeknal version
pip show seeknal
# Update to latest version if needed
pip install --upgrade seeknal
Runtime Issues¶
"FileNotFoundError" for CSV source¶
Cause: The table: path in your source YAML doesn't point to an existing file.
Solution:
# Check the path in your source YAML
# Use absolute paths or paths relative to the project root
ls -la data/orders.csv
"ref not found" in dry-run¶
Cause: The referenced upstream node hasn't been applied yet.
Solution:
# Apply nodes in dependency order
seeknal apply draft_source_raw_orders.yml # Apply source first
seeknal apply draft_transform_clean_orders.yml # Then transform
Getting Help¶
If you're still stuck:
- Check the logs — Look for error messages in the console
- Use
seeknal docs— Search built-in documentation from the CLI - Search GitHub Issues — Someone may have solved your problem
- Open a new issue — Include your Python version, OS, and error message
Summary¶
Congratulations! You've completed the Seeknal Getting Started guide. You now know how to:
- Install Seeknal from PyPI
- Initialize a project with
seeknal init - Define sources, transforms, and feature groups in YAML
- Use the
draft → dry-run → apply → runworkflow - Query results with
seeknal repl
Ready for more? Try the YAML Pipeline Tutorial for a deeper walkthrough, or the Python Pipelines Tutorial for decorator-based pipelines.
Last updated: February 2026 | Seeknal Documentation