Glossary¶

Definitions of key terms used throughout Seeknal documentation, organized alphabetically.

Aggregation¶

A node type that computes aggregate statistics (sum, count, average, etc.) over grouped data. Aggregations are first-level aggregations that take a source or transform as input and produce summarized output at the entity level.

YAML kind: aggregation See also: Second-Order Aggregation, Transform, YAML Schema Reference

Apply¶

The act of executing a plan in an environment. After reviewing changes with seeknal plan or seeknal dry-run, use seeknal apply to move validated YAML files to production and update the manifest. This is the final step in the draft-dry-run-apply workflow.

CLI command: seeknal apply <file.yml>, seeknal env apply See also: Draft, Dry Run, Plan, Virtual Environments

Audit¶

The process of reviewing and validating pipeline execution history, data quality, and compliance requirements. Seeknal supports querying historical data for regulatory requirements through time travel capabilities in Iceberg materialization.

See also: Iceberg Materialization, Validation

Cache¶

Temporary storage of execution results to improve performance. Seeknal caches execution state and can reuse intermediate results when nodes haven't changed based on fingerprint comparison.

See also: Fingerprint, State

Change Categorization¶

A system for classifying changes to pipelines into three categories: BREAKING (incompatible schema changes), NON_BREAKING (compatible additions), and METADATA (non-functional changes). This helps teams understand the impact of changes before applying them.

Categories: BREAKING, NON_BREAKING, METADATA See also: Change Categorization Guide, Dry Run

DAG¶

Directed Acyclic Graph - a graph structure where nodes (sources, transforms, feature groups) are connected by directed edges with no cycles. Seeknal uses DAGs to represent pipeline dependencies and determine execution order through topological sorting.

See also: Node, Topological Layer, Manifest

Data Leakage¶

A critical error in ML where future information leaks into training data, causing models to overfit. Seeknal prevents data leakage using point-in-time joins that ensure features are computed only from data available at the time of prediction.

See also: Point-in-Time Join, Feature Start Time

Dimension¶

In the context of semantic models, a dimension is a categorical attribute used for grouping and filtering data (e.g., customer_country, product_category). Dimensions are used to slice metrics in business intelligence queries.

See also: Metric, Semantic Model

Draft¶

A template-generated YAML file for creating new pipeline nodes. Created using seeknal draft <type> <name>, draft files follow the naming convention draft_<type>_<name>.yml and must be validated with dry-run before applying.

CLI command: seeknal draft <type> <name> File naming: draft_<type>_<name>.yml See also: Dry Run, Apply

Dry Run¶

A validation and preview step that tests YAML files without applying them to production. Executes the pipeline with a limited row count to catch errors early and preview output before applying changes.

CLI command: seeknal dry-run <file.yml> Flags: --limit <n>, --timeout <s>, --schema-only See also: Draft, Apply, Validation

DuckDB¶

A lightweight, in-process SQL database engine used by Seeknal for data processing. DuckDB is the preferred engine for datasets under 100M rows, offering pure Python implementation with no JVM requirement and fast performance for single-node deployments.

Storage format: Parquet + JSON metadata Module: src/seeknal/featurestore/duckdbengine/ See also: Task, Spark Engine

Entity¶

A domain object with a unique identifier, used as the join key for feature groups. Entities represent business concepts like customers, products, or transactions. Each feature group is associated with one entity.

Python class: Entity YAML field: entity.name, entity.join_keys See also: Feature Group, Join Keys

Environment¶

An isolated namespace for deploying different versions of pipelines (e.g., dev, staging, prod). Virtual environments enable teams to test changes safely before promoting to production.

CLI command: seeknal env create, seeknal env apply, seeknal env promote See also: Virtual Environments, Promote

Exposure¶

A pipeline output that exposes data to downstream consumers such as APIs, dashboards, or file exports. Exposures document how pipeline data is used and help track data lineage.

YAML kind: exposure See also: Node, DAG

Feature Group¶

A container for related ML features with shared entity keys, materialization config, and versioning. Feature groups automatically version on schema changes and support both offline (batch) and online (serving) stores.

YAML kind: feature_group Python class: FeatureGroup, FeatureGroupDuckDB CLI commands: seeknal list feature-groups, seeknal delete feature-group <name> See also: Entity, Materialization, Offline Store, Online Store

Feature Start Time¶

The timestamp from which features should be materialized. Used in feature group materialization to specify the earliest point in time for feature computation, enabling backfilling and incremental updates.

Python parameter: feature_start_time See also: Materialization, Point-in-Time Join

Fingerprint¶

A hash computed from a node's definition (SQL, dependencies, parameters) used for change detection. Seeknal compares fingerprints between runs to determine which nodes need re-execution, enabling incremental execution.

See also: State, Cache, Parallel Execution

Flow¶

A data transformation pipeline that chains multiple tasks together. Flows can mix Spark and DuckDB tasks, executing them sequentially to transform data from input to output.

Python class: Flow Definition: Flow(name, input, tasks, output) See also: Task, DuckDB, Spark Engine

Iceberg Materialization¶

Persisting pipeline outputs to Apache Iceberg table format with ACID transactions, time travel, and schema evolution. Enables atomic commits, rollback capabilities, and querying historical data snapshots.

Configuration file: ~/.seeknal/profiles.yml CLI commands: seeknal iceberg validate-materialization, seeknal iceberg snapshot-list See also: Iceberg Materialization Guide, Materialization

Join Keys¶

The column(s) used to join features to the target dataset in a feature group. Join keys uniquely identify the entity and are specified in the entity definition.

YAML field: entity.join_keys Example: ["customer_id"], ["msisdn", "movement_type"] See also: Entity, Feature Group

Manifest¶

A JSON file containing the parsed DAG representation of all pipeline nodes with their dependencies, execution order, and metadata. Generated by seeknal plan or seeknal parse and used by seeknal run for execution.

CLI command: seeknal plan, seeknal parse See also: DAG, Plan, Topological Layer

Materialization¶

The process of computing and persisting features to storage. Includes offline materialization (batch processing for training data) and online materialization (low-latency serving for inference). Supports multiple modes including append and overwrite.

Python class: Materialization, OfflineMaterialization YAML field: materialization Modes: append, overwrite See also: Feature Group, Offline Store, Online Store, Iceberg Materialization

Metric¶

A quantitative measure computed from data (e.g., total_revenue, average_order_value). In semantic models, metrics are aggregations of measures that can be sliced by dimensions.

See also: Dimension, Semantic Model

Model¶

A machine learning model definition that consumes features and produces predictions. Models are nodes in the pipeline DAG and can depend on feature groups and transforms.

YAML kind: model See also: Feature Group, Node

Node¶

A single unit in the pipeline DAG representing a source, transform, feature group, aggregation, model, rule, or exposure. Each node has a unique name, type (kind), dependencies (inputs), and execution logic.

Supported kinds: source, transform, feature_group, aggregation, second_order_aggregation, model, rule, exposure See also: DAG, Topological Layer

Offline Store¶

Storage backend for batch feature computation used in training data. Supports file-based storage (Parquet, CSV) and cloud storage (S3, GCS). Optimized for large-scale batch processing with higher latency than online stores.

Python class: OfflineStore, OfflineStoreEnum Storage formats: FILE (Parquet), cloud storage (S3, GCS) See also: Materialization, Online Store, Feature Group

Online Store¶

Storage backend for low-latency feature serving used in inference. Optimized for real-time feature retrieval with TTL (time-to-live) support for feature expiration.

Python class: OnlineStore, OnlineStoreEnum Features: Low-latency retrieval, TTL support See also: Materialization, Offline Store, Feature Group

Parallel Execution¶

The ability to execute independent nodes concurrently within the same topological layer. Seeknal identifies nodes without dependencies and runs them in parallel to reduce total execution time.

See also: Topological Layer, DAG, State

Plan¶

A preview of changes and execution order before running the pipeline. The plan command generates a manifest showing the DAG structure, topological layers, and identifies which nodes will execute based on state comparison.

CLI command: seeknal plan, seeknal parse See also: Manifest, Apply, Dry Run

Point-in-Time Join¶

A temporal join that ensures features are computed only from data available at the prediction time, preventing data leakage. Seeknal uses the event_time_col to perform point-in-time correct joins when materializing historical features.

Configuration: materialization.event_time_col See also: Point-in-Time Joins Guide, Data Leakage, Materialization

Python Pipeline¶

A data transformation pipeline defined in Python using the Flow API or as standalone Python scripts with inline transformations. Python pipelines support custom business logic, ML model training, and external API integration that cannot be expressed in SQL alone.

File location: seeknal/pipelines/*.py See also: Python Pipelines Guide, Python API vs YAML Workflows, Flow

Promote¶

The act of moving a validated pipeline from one environment to another (e.g., dev to staging to prod). Promotion copies environment-specific configurations and validates compatibility before deployment.

CLI command: seeknal env promote --from dev --to prod See also: Environment, Virtual Environments, Apply

Rule¶

A validation or business rule node that checks data quality constraints. Rules can enforce data integrity, completeness, and business logic requirements within the pipeline.

YAML kind: rule See also: Validation, Node

Second-Order Aggregation¶

An aggregation that computes summary statistics over first-order aggregations. For example, computing regional totals from user-level metrics. Second-order aggregations support hierarchical rollups and multi-level analytics, enabling you to build features at different levels of granularity without duplicating aggregation logic.

YAML kind: second_order_aggregation Example use case: Region-level totals from user-level metrics, department rollups from team metrics See also: Second-Order Aggregations Guide, Aggregation, YAML Pipeline Tutorial

Semantic Model¶

A business-oriented abstraction layer that defines metrics, dimensions, and their relationships. Semantic models enable non-technical users to query data using business terminology without writing SQL.

See also: Metric, Dimension, Semantic Layer Guide

Source¶

A node representing raw data input from external systems. Sources define the schema, connection parameters, and location of data files or database tables.

YAML kind: source Source types: csv, parquet, json, postgresql, etc. See also: Node, Transform

Spark Engine¶

Apache Spark implementation for distributed data processing. Supports Delta Lake format and is optimized for datasets over 100M rows requiring distributed processing. Requires JVM installation.

Storage format: Delta Lake Module: src/seeknal/tasks/sparkengine/ Python class: SparkEngineTask See also: DuckDB, Task, Flow

State¶

Execution history tracking that stores fingerprints of previously executed nodes. State enables incremental execution by comparing current definitions with previous runs to identify changed nodes.

Storage: SQLite/Turso database See also: Fingerprint, Cache, Parallel Execution

Task¶

A unit of data transformation logic within a Flow. Tasks can be DuckDB tasks (DuckDBTask) or Spark tasks (SparkEngineTask) and contain SQL statements or transformation operations.

Python classes: DuckDBTask, SparkEngineTask Usage: task.add_sql("SELECT * FROM __THIS__") See also: Flow, DuckDB, Spark Engine

Topological Layer¶

A group of nodes at the same dependency depth in the DAG. Nodes in the same layer have no dependencies on each other and can be executed in parallel. Layers are ordered from 0 (sources) to N (final outputs).

Example: Layer 0: sources, Layer 1: transforms, Layer 2: feature groups See also: DAG, Parallel Execution, Manifest

Transform¶

A SQL transformation node that processes data from sources or other transforms. Transforms define reusable SQL logic and support DuckDB SQL syntax including CTEs, window functions, and joins.

YAML kind: transform YAML field: transform (contains SQL) Reference syntax: ref: kind.name (e.g., source.customers) See also: Source, Feature Group, Node

Validation¶

The process of checking data quality, schema correctness, and business rule compliance. Validation occurs during dry-run for YAML syntax and schema, and via rule nodes for data quality constraints.

CLI command: seeknal dry-run (YAML validation), seeknal validate-features (feature validation) See also: Dry Run, Rule, Audit

Virtual Environments¶

Isolated namespaces for deploying different versions of pipelines (dev, staging, prod). Each environment maintains separate state, allowing teams to test changes in isolation before promoting to production.

CLI commands: seeknal env create, seeknal env list, seeknal env promote See also: Virtual Environments Guide, Environment, Promote

Workspace¶

A project context that contains configuration, database connection, and environment settings. Every Seeknal operation requires a workspace context, established via decorators like @require_workspace.

Python decorator: @require_workspace See also: Project, Environment

Project¶

A logical grouping of pipelines, feature groups, and configurations stored in the database. Projects provide isolation between different teams or use cases and are required for most Seeknal operations.

Python class: Project Python decorator: @require_project CLI command: seeknal init --name <project_name> See also: Workspace, Entity

Last updated: 2026-02-09