import pandas as pd
orders = pd.read_parquet('orders.parquet')
orders['ordered_at'] = pd.to_datetime(orders['ordered_at'])
reference_date = orders['ordered_at'].max() + pd.Timedelta(days=1)
features = (
orders.groupby('customer_id')
.agg(
recency_days=('ordered_at', lambda values: (reference_date - values.max()).days),
frequency=('order_id', 'nunique'),
monetary=('amount', 'sum'),
avg_basket=('amount', 'mean'),
)
.reset_index()
)
features['monetary_per_order'] = features['monetary'] / features['frequency'].clip(lower=1)
print(features.head())
Tabular models improve fast when you encode behavior rather than raw events. Recency, frequency, and monetary aggregates are durable baseline features for retention, fraud, and conversion use cases. I usually build them in pure pandas first, then port them to a scheduled feature pipeline once the signal is proven.