Skip to content

Feature Groups

Feature groups organize features for machine learning with point-in-time correctness.


Overview

Feature groups are collections of features for a specific entity (e.g., customer, product). They enable feature reuse and prevent data leakage through point-in-time joins.


Feature Group Concepts

Entities

Entities define the join key for features:

from seeknal.entity import Entity

customer_entity = Entity(
    name="customer",
    join_keys=["customer_id"]
)

Materialization

Control when and how features are computed:

from seeknal.featurestore.duckdbengine.feature_group import Materialization

materialization = Materialization(
    event_time_col="transaction_time",
    lookback_days=30
)

Creating Feature Groups

DuckDB Feature Groups

from seeknal.featurestore.duckdbengine.feature_group import FeatureGroupDuckDB

fg = FeatureGroupDuckDB(
    name="customer_features",
    entity=customer_entity,
    materialization=materialization,
    project="my_project"
)

# Auto-detect features from DataFrame
fg.set_dataframe(df).set_features()

# Write to offline store
fg.write(feature_start_time=datetime(2024, 1, 1))

Spark Feature Groups

from seeknal.featurestore.feature_group import FeatureGroup

fg = FeatureGroup(
    name="customer_features",
    entity=customer_entity,
    project="my_project"
)

# Set features explicitly
fg.set_features({
    "total_purchases": "sum(purchase_amount)",
    "transaction_count": "count(*)"
})

Point-in-Time Joins

Prevent data leakage by joining features as they appeared at prediction time. Use FeatureFrame.pit_join() in transforms:

from seeknal.pipeline import transform

@transform(name="training_data")
def training_data(ctx):
    # Spine with entity keys and prediction dates
    labels = ctx.ref("source.churn_labels")

    # PIT join: get features as of each application_date
    features_df = ctx.ref("feature_group.customer_features").pit_join(
        spine=labels,
        date_col="application_date",
        keep_cols=["label"],
    )
    return features_df

Online Serving

Retrieve features for low-latency serving using ctx.features():

from seeknal.pipeline import transform

@transform(name="predictions")
def predictions(ctx):
    # Get latest features from entity store
    features = ctx.features("customer", [
        "customer_features.total_purchases",
        "customer_features.avg_order_value",
    ])
    return features

Feature Group Configuration

Common Options

Option Type Description Default
name string Unique identifier Required
entity Entity Entity definition Required
materialization Materialization Materialization config Optional
project string Project name Required

Best Practices

  1. Always use point-in-time joins for ML features
  2. Define clear entities with join keys
  3. Version feature groups for production
  4. Document feature logic clearly
  5. Test feature serving before deployment


Next: Learn about Semantic Models or return to Building Blocks