Feature Groups¶
Feature groups organize features for machine learning with point-in-time correctness.
Overview¶
Feature groups are collections of features for a specific entity (e.g., customer, product). They enable feature reuse and prevent data leakage through point-in-time joins.
Feature Group Concepts¶
Entities¶
Entities define the join key for features:
from seeknal.entity import Entity
customer_entity = Entity(
name="customer",
join_keys=["customer_id"]
)
Materialization¶
Control when and how features are computed:
from seeknal.featurestore.duckdbengine.feature_group import Materialization
materialization = Materialization(
event_time_col="transaction_time",
lookback_days=30
)
Creating Feature Groups¶
DuckDB Feature Groups¶
from seeknal.featurestore.duckdbengine.feature_group import FeatureGroupDuckDB
fg = FeatureGroupDuckDB(
name="customer_features",
entity=customer_entity,
materialization=materialization,
project="my_project"
)
# Auto-detect features from DataFrame
fg.set_dataframe(df).set_features()
# Write to offline store
fg.write(feature_start_time=datetime(2024, 1, 1))
Spark Feature Groups¶
from seeknal.featurestore.feature_group import FeatureGroup
fg = FeatureGroup(
name="customer_features",
entity=customer_entity,
project="my_project"
)
# Set features explicitly
fg.set_features({
"total_purchases": "sum(purchase_amount)",
"transaction_count": "count(*)"
})
Point-in-Time Joins¶
Prevent data leakage by joining features as they appeared at prediction time. Use FeatureFrame.pit_join() in transforms:
from seeknal.pipeline import transform
@transform(name="training_data")
def training_data(ctx):
# Spine with entity keys and prediction dates
labels = ctx.ref("source.churn_labels")
# PIT join: get features as of each application_date
features_df = ctx.ref("feature_group.customer_features").pit_join(
spine=labels,
date_col="application_date",
keep_cols=["label"],
)
return features_df
Online Serving¶
Retrieve features for low-latency serving using ctx.features():
from seeknal.pipeline import transform
@transform(name="predictions")
def predictions(ctx):
# Get latest features from entity store
features = ctx.features("customer", [
"customer_features.total_purchases",
"customer_features.avg_order_value",
])
return features
Feature Group Configuration¶
Common Options¶
| Option | Type | Description | Default |
|---|---|---|---|
name |
string | Unique identifier | Required |
entity |
Entity | Entity definition | Required |
materialization |
Materialization | Materialization config | Optional |
project |
string | Project name | Required |
Best Practices¶
- Always use point-in-time joins for ML features
- Define clear entities with join keys
- Version feature groups for production
- Document feature logic clearly
- Test feature serving before deployment
Related Topics¶
- Point-in-Time Joins - Preventing data leakage
- Training to Serving - End-to-end ML workflow
- ML Engineer Path - Complete ML tutorial
Next: Learn about Semantic Models or return to Building Blocks