Quick Start - Python API¶

Estimated Time: 10 minutes | Difficulty: Beginner | Format: Python

Build your first Seeknal pipeline using Python. This approach is ideal for: - ML Engineers building feature stores with complex logic - Data Scientists who prefer programmatic control - Developers integrating Seeknal into Python applications

Why Python?¶

Python pipelines offer several advantages:

Programmatic Control: Use loops, conditionals, and functions
Complex Logic: Build sophisticated transformations with Python
Testing: Write unit tests for your pipelines
Integration: Embed in Jupyter notebooks or Python applications

Prerequisites¶

Before starting, ensure you have:

Requirement	Version	Check
Python	3.11+	`python --version`
pip	Latest	`pip --version`
pandas	Latest	`pip show pandas`

Python Version Check

# Check your Python version
python --version

Seeknal requires Python 3.11 or higher.

Part 1: Install & Setup (2 minutes)¶

Step 1: Install Seeknal¶

pip install seeknal

# Verify installation
python -c "import seeknal; print('Seeknal installed successfully!')"

Expected output: Seeknal installed successfully!

More Options

For detailed installation instructions, see the Installation Guide.

Step 2: Create Your Project¶

seeknal init --name quickstart-python --description "Python pipeline quickstart"
cd quickstart-python

Part 2: Understand the Python Pipeline Workflow (2 minutes)¶

Seeknal's Python workflow is straightforward:

graph LR
    A[Load Data] --> B[Create Task]
    B --> C[Add SQL]
    C --> D[Transform]
    D --> E[Save Results]

Step	Method	What It Does
1. Load	`pd.read_csv()`	Load data into pandas
2. Create	`DuckDBTask()`	Create transformation task
3. Add	`add_input()` + `add_sql()`	Specify data and SQL
4. Transform	`transform()`	Execute transformation
5. Save	`to_parquet()`	Save results

Why Python?

Use Python when you need: - Complex conditional logic - Dynamic SQL generation - Integration with ML frameworks - Programmatic pipeline construction

Part 3: Create Your Python Pipeline (4 minutes)¶

Step 1: Create Sample Data¶

Create data/sales.csv:

date,product_category,quantity,revenue
2024-01-01,Electronics,5,500.00
2024-01-01,Clothing,10,200.00
2024-01-01,Electronics,3,300.00
2024-01-02,Clothing,8,160.00
2024-01-02,Electronics,2,200.00
2024-01-02,Home & Garden,4,120.00
2024-01-03,Electronics,6,600.00
2024-01-03,Clothing,12,240.00
2024-01-03,Home & Garden,3,90.00

Step 2: Create Your Pipeline Script¶

Create pipeline.py:

#!/usr/bin/env python3
"""
Quick Start - Python Pipeline
Calculates daily revenue by product category
"""

import pandas as pd
import pyarrow as pa
from pathlib import Path
from seeknal.tasks.duckdb import DuckDBTask

# Load the data
print("Loading data...")
df = pd.read_csv("data/sales.csv")
print(f"Loaded {len(df)} rows")

# Create DuckDB task
print("Creating transformation task...")
task = DuckDBTask()

# Add input data (convert pandas to Arrow)
arrow_table = pa.Table.from_pandas(df)
task.add_input(dataframe=arrow_table)

# Define transformation SQL
sql = """
SELECT
    date,
    product_category,
    SUM(quantity) as total_quantity,
    SUM(revenue) as daily_revenue
FROM __THIS__
GROUP BY date, product_category
ORDER BY date, daily_revenue DESC
"""

task.add_sql(sql)

# Execute transformation
print("Executing transformation...")
result_arrow = task.transform()

# Convert back to pandas
result_df = result_arrow.to_pandas()

# Save results
output_dir = Path("output")
output_dir.mkdir(exist_ok=True)
output_path = output_dir / "daily_revenue.parquet"

result_df.to_parquet(output_path, index=False)
print(f"Results saved to: {output_path}")

# Display results
print("\nResults:")
print(result_df.to_string(index=False))

THIS Placeholder

The __THIS__ placeholder automatically references your input data. No complex table names needed.

Part 4: Run and Verify (2 minutes)¶

Step 1: Execute Your Pipeline¶

python pipeline.py

Expected output:

Loading data...
Loaded 9 rows
Creating transformation task...
Executing transformation...
Results saved to: output/daily_revenue.parquet

Results:
        date product_category  total_quantity  daily_revenue
2024-01-01      Electronics               8         800.00
2024-01-01        Clothing              10         200.00
2024-01-02      Electronics               2         200.00
2024-01-02        Clothing               8         160.00
2024-01-02   Home & Garden               4         120.00
2024-01-03      Electronics               6         600.00
2024-01-03        Clothing              12         240.00
2024-01-03   Home & Garden               3          90.00

Step 2: Verify Output¶

# Check the output file
ls -lh output/daily_revenue.parquet

# View with Python
python -c "import pandas as pd; print(pd.read_parquet('output/daily_revenue.parquet'))"

Congratulations!

You've built a complete data pipeline using Python — programmatic, testable, and flexible.

What Makes Python Great?¶

For ML Engineers - Feature Engineering¶

# Dynamic feature generation
def create_feature_groups(df, categories):
    task = DuckDBTask()
    arrow_table = pa.Table.from_pandas(df)
    task.add_input(dataframe=arrow_table)

    for category in categories:
        sql = f"""
        SELECT
            user_id,
            COUNT(*) as {category}_count,
            SUM(revenue) as {category}_revenue
        FROM __THIS__
        WHERE product_category = '{category}'
        GROUP BY user_id
        """
        task.add_sql(sql)

    return task.transform()

For Data Scientists - Conditional Logic¶

# Adaptive transformations
def smart_aggregation(df, metric_type):
    if metric_type == "revenue":
        sql = "SELECT user_id, SUM(revenue) FROM __THIS__ GROUP BY user_id"
    elif metric_type == "engagement":
        sql = "SELECT user_id, AVG(session_duration) FROM __THIS__ GROUP BY user_id"
    else:
        raise ValueError(f"Unknown metric type: {metric_type}")

    task = DuckDBTask()
    task.add_input(dataframe=pa.Table.from_pandas(df))
    task.add_sql(sql)
    return task.transform()

For Developers - Integration¶

# Embed in FastAPI application
from fastapi import FastAPI
from seeknal.tasks.duckdb import DuckDBTask

app = FastAPI()

@app.post("/transform")
async def transform_data(data: dict):
    df = pd.DataFrame(data)
    task = DuckDBTask()
    task.add_input(dataframe=pa.Table.from_pandas(df))
    task.add_sql("SELECT * FROM __THIS__ WHERE value > 100")
    result = task.transform()
    return result.to_pandas().to_dict(orient="records")

When to Use Python vs YAML¶

Use Python When...	Use YAML When...
Need complex logic	Want simple, declarative pipelines
Building ML features	Creating standard ETL/ELT jobs
Dynamic SQL generation	Prefer version-controlled configs
Integrating with Python apps	Working with non-technical stakeholders
Writing unit tests	Want git-friendly diffs

Pro Tip

You can mix both! Use YAML for standard pipelines and Python for complex feature engineering. Both approaches work with the same Seeknal engine.

What's Next?¶

Choose your learning path:

Data Engineer	Analytics Engineer	ML Engineer
Python ELT Pipelines	Semantic Models in Python	Feature Store API

Troubleshooting¶

Common Issues

Problem: ModuleNotFoundError: No module named 'seeknal' - Activate your virtual environment: source .venv/bin/activate - Verify installation: pip show seeknal

Problem: ArrowError: Column type mismatch - Check that pandas DataFrame dtypes match your expectations - Use df.dtypes to inspect column types

Problem: SQL syntax error - Verify SQL syntax is valid - Check column names match input DataFrame

Full Troubleshooting Guide →