Skip to content

Glossary

Definitions of key terms used throughout Seeknal documentation, organized alphabetically.


Aggregation

A node type that computes aggregate statistics (sum, count, average, etc.) over grouped data. Aggregations are first-level aggregations that take a source or transform as input and produce summarized output at the entity level.

YAML kind: aggregation See also: Second-Order Aggregation, Transform, YAML Schema Reference


Apply

The act of executing a plan in an environment. After reviewing changes with seeknal plan or seeknal dry-run, use seeknal apply to move validated YAML files to production and update the manifest. This is the final step in the draft-dry-run-apply workflow.

CLI command: seeknal apply <file.yml>, seeknal env apply See also: Draft, Dry Run, Plan, Virtual Environments


Audit

The process of reviewing and validating pipeline execution history, data quality, and compliance requirements. Seeknal supports querying historical data for regulatory requirements through time travel capabilities in Iceberg materialization.

See also: Iceberg Materialization, Validation


Cache

Temporary storage of execution results to improve performance. Seeknal caches execution state and can reuse intermediate results when nodes haven't changed based on fingerprint comparison.

See also: Fingerprint, State


Change Categorization

A system for classifying changes to pipelines into three categories: BREAKING (incompatible schema changes), NON_BREAKING (compatible additions), and METADATA (non-functional changes). This helps teams understand the impact of changes before applying them.

Categories: BREAKING, NON_BREAKING, METADATA See also: Change Categorization Guide, Dry Run


DAG

Directed Acyclic Graph - a graph structure where nodes (sources, transforms, feature groups) are connected by directed edges with no cycles. Seeknal uses DAGs to represent pipeline dependencies and determine execution order through topological sorting.

See also: Node, Topological Layer, Manifest


Data Leakage

A critical error in ML where future information leaks into training data, causing models to overfit. Seeknal prevents data leakage using point-in-time joins that ensure features are computed only from data available at the time of prediction.

See also: Point-in-Time Join, Feature Start Time


Dimension

In the context of semantic models, a dimension is a categorical attribute used for grouping and filtering data (e.g., customer_country, product_category). Dimensions are used to slice metrics in business intelligence queries.

See also: Metric, Semantic Model


Draft

A template-generated YAML file for creating new pipeline nodes. Created using seeknal draft <type> <name>, draft files follow the naming convention draft_<type>_<name>.yml and must be validated with dry-run before applying.

CLI command: seeknal draft <type> <name> File naming: draft_<type>_<name>.yml See also: Dry Run, Apply


Dry Run

A validation and preview step that tests YAML files without applying them to production. Executes the pipeline with a limited row count to catch errors early and preview output before applying changes.

CLI command: seeknal dry-run <file.yml> Flags: --limit <n>, --timeout <s>, --schema-only See also: Draft, Apply, Validation


DuckDB

A lightweight, in-process SQL database engine used by Seeknal for data processing. DuckDB is the preferred engine for datasets under 100M rows, offering pure Python implementation with no JVM requirement and fast performance for single-node deployments.

Storage format: Parquet + JSON metadata Module: src/seeknal/featurestore/duckdbengine/ See also: Task, Spark Engine


Entity

A domain object with a unique identifier, used as the join key for feature groups. Entities represent business concepts like customers, products, or transactions. Each feature group is associated with one entity.

Python class: Entity YAML field: entity.name, entity.join_keys See also: Feature Group, Join Keys


Environment

An isolated namespace for deploying different versions of pipelines (e.g., dev, staging, prod). Virtual environments enable teams to test changes safely before promoting to production.

CLI command: seeknal env create, seeknal env apply, seeknal env promote See also: Virtual Environments, Promote


Exposure

A pipeline output that exposes data to downstream consumers such as APIs, dashboards, or file exports. Exposures document how pipeline data is used and help track data lineage.

YAML kind: exposure See also: Node, DAG


Feature Group

A container for related ML features with shared entity keys, materialization config, and versioning. Feature groups automatically version on schema changes and support both offline (batch) and online (serving) stores.

YAML kind: feature_group Python class: FeatureGroup, FeatureGroupDuckDB CLI commands: seeknal list feature-groups, seeknal delete feature-group <name> See also: Entity, Materialization, Offline Store, Online Store


Feature Start Time

The timestamp from which features should be materialized. Used in feature group materialization to specify the earliest point in time for feature computation, enabling backfilling and incremental updates.

Python parameter: feature_start_time See also: Materialization, Point-in-Time Join


Fingerprint

A hash computed from a node's definition (SQL, dependencies, parameters) used for change detection. Seeknal compares fingerprints between runs to determine which nodes need re-execution, enabling incremental execution.

See also: State, Cache, Parallel Execution


Flow

A data transformation pipeline that chains multiple tasks together. Flows can mix Spark and DuckDB tasks, executing them sequentially to transform data from input to output.

Python class: Flow Definition: Flow(name, input, tasks, output) See also: Task, DuckDB, Spark Engine


Iceberg Materialization

Persisting pipeline outputs to Apache Iceberg table format with ACID transactions, time travel, and schema evolution. Enables atomic commits, rollback capabilities, and querying historical data snapshots.

Configuration file: ~/.seeknal/profiles.yml CLI commands: seeknal iceberg validate-materialization, seeknal iceberg snapshot-list See also: Iceberg Materialization Guide, Materialization


Join Keys

The column(s) used to join features to the target dataset in a feature group. Join keys uniquely identify the entity and are specified in the entity definition.

YAML field: entity.join_keys Example: ["customer_id"], ["msisdn", "movement_type"] See also: Entity, Feature Group


Manifest

A JSON file containing the parsed DAG representation of all pipeline nodes with their dependencies, execution order, and metadata. Generated by seeknal plan or seeknal parse and used by seeknal run for execution.

CLI command: seeknal plan, seeknal parse See also: DAG, Plan, Topological Layer


Materialization

The process of computing and persisting features to storage. Includes offline materialization (batch processing for training data) and online materialization (low-latency serving for inference). Supports multiple modes including append and overwrite.

Python class: Materialization, OfflineMaterialization YAML field: materialization Modes: append, overwrite See also: Feature Group, Offline Store, Online Store, Iceberg Materialization


Metric

A quantitative measure computed from data (e.g., total_revenue, average_order_value). In semantic models, metrics are aggregations of measures that can be sliced by dimensions.

See also: Dimension, Semantic Model


Model

A machine learning model definition that consumes features and produces predictions. Models are nodes in the pipeline DAG and can depend on feature groups and transforms.

YAML kind: model See also: Feature Group, Node


Node

A single unit in the pipeline DAG representing a source, transform, feature group, aggregation, model, rule, or exposure. Each node has a unique name, type (kind), dependencies (inputs), and execution logic.

Supported kinds: source, transform, feature_group, aggregation, second_order_aggregation, model, rule, exposure See also: DAG, Topological Layer


Offline Store

Storage backend for batch feature computation used in training data. Supports file-based storage (Parquet, CSV) and cloud storage (S3, GCS). Optimized for large-scale batch processing with higher latency than online stores.

Python class: OfflineStore, OfflineStoreEnum Storage formats: FILE (Parquet), cloud storage (S3, GCS) See also: Materialization, Online Store, Feature Group


Online Store

Storage backend for low-latency feature serving used in inference. Optimized for real-time feature retrieval with TTL (time-to-live) support for feature expiration.

Python class: OnlineStore, OnlineStoreEnum Features: Low-latency retrieval, TTL support See also: Materialization, Offline Store, Feature Group


Parallel Execution

The ability to execute independent nodes concurrently within the same topological layer. Seeknal identifies nodes without dependencies and runs them in parallel to reduce total execution time.

See also: Topological Layer, DAG, State


Plan

A preview of changes and execution order before running the pipeline. The plan command generates a manifest showing the DAG structure, topological layers, and identifies which nodes will execute based on state comparison.

CLI command: seeknal plan, seeknal parse See also: Manifest, Apply, Dry Run


Point-in-Time Join

A temporal join that ensures features are computed only from data available at the prediction time, preventing data leakage. Seeknal uses the event_time_col to perform point-in-time correct joins when materializing historical features.

Configuration: materialization.event_time_col See also: Point-in-Time Joins Guide, Data Leakage, Materialization


Python Pipeline

A data transformation pipeline defined in Python using the Flow API or as standalone Python scripts with inline transformations. Python pipelines support custom business logic, ML model training, and external API integration that cannot be expressed in SQL alone.

File location: seeknal/pipelines/*.py See also: Python Pipelines Guide, Python API vs YAML Workflows, Flow


Promote

The act of moving a validated pipeline from one environment to another (e.g., dev to staging to prod). Promotion copies environment-specific configurations and validates compatibility before deployment.

CLI command: seeknal env promote --from dev --to prod See also: Environment, Virtual Environments, Apply


Rule

A validation or business rule node that checks data quality constraints. Rules can enforce data integrity, completeness, and business logic requirements within the pipeline.

YAML kind: rule See also: Validation, Node


Second-Order Aggregation

An aggregation that computes summary statistics over first-order aggregations. For example, computing regional totals from user-level metrics. Second-order aggregations support hierarchical rollups and multi-level analytics, enabling you to build features at different levels of granularity without duplicating aggregation logic.

YAML kind: second_order_aggregation Example use case: Region-level totals from user-level metrics, department rollups from team metrics See also: Second-Order Aggregations Guide, Aggregation, YAML Pipeline Tutorial


Semantic Model

A business-oriented abstraction layer that defines metrics, dimensions, and their relationships. Semantic models enable non-technical users to query data using business terminology without writing SQL.

See also: Metric, Dimension, Semantic Layer Guide


Source

A node representing raw data input from external systems. Sources define the schema, connection parameters, and location of data files or database tables.

YAML kind: source Source types: csv, parquet, json, postgresql, etc. See also: Node, Transform


Spark Engine

Apache Spark implementation for distributed data processing. Supports Delta Lake format and is optimized for datasets over 100M rows requiring distributed processing. Requires JVM installation.

Storage format: Delta Lake Module: src/seeknal/tasks/sparkengine/ Python class: SparkEngineTask See also: DuckDB, Task, Flow


State

Execution history tracking that stores fingerprints of previously executed nodes. State enables incremental execution by comparing current definitions with previous runs to identify changed nodes.

Storage: SQLite/Turso database See also: Fingerprint, Cache, Parallel Execution


Task

A unit of data transformation logic within a Flow. Tasks can be DuckDB tasks (DuckDBTask) or Spark tasks (SparkEngineTask) and contain SQL statements or transformation operations.

Python classes: DuckDBTask, SparkEngineTask Usage: task.add_sql("SELECT * FROM __THIS__") See also: Flow, DuckDB, Spark Engine


Topological Layer

A group of nodes at the same dependency depth in the DAG. Nodes in the same layer have no dependencies on each other and can be executed in parallel. Layers are ordered from 0 (sources) to N (final outputs).

Example: Layer 0: sources, Layer 1: transforms, Layer 2: feature groups See also: DAG, Parallel Execution, Manifest


Transform

A SQL transformation node that processes data from sources or other transforms. Transforms define reusable SQL logic and support DuckDB SQL syntax including CTEs, window functions, and joins.

YAML kind: transform YAML field: transform (contains SQL) Reference syntax: ref: kind.name (e.g., source.customers) See also: Source, Feature Group, Node


Validation

The process of checking data quality, schema correctness, and business rule compliance. Validation occurs during dry-run for YAML syntax and schema, and via rule nodes for data quality constraints.

CLI command: seeknal dry-run (YAML validation), seeknal validate-features (feature validation) See also: Dry Run, Rule, Audit


Virtual Environments

Isolated namespaces for deploying different versions of pipelines (dev, staging, prod). Each environment maintains separate state, allowing teams to test changes in isolation before promoting to production.

CLI commands: seeknal env create, seeknal env list, seeknal env promote See also: Virtual Environments Guide, Environment, Promote


Workspace

A project context that contains configuration, database connection, and environment settings. Every Seeknal operation requires a workspace context, established via decorators like @require_workspace.

Python decorator: @require_workspace See also: Project, Environment


Project

A logical grouping of pipelines, feature groups, and configurations stored in the database. Projects provide isolation between different teams or use cases and are required for most Seeknal operations.

Python class: Project Python decorator: @require_project CLI command: seeknal init --name <project_name> See also: Workspace, Entity


Last updated: 2026-02-09