import pandas as pd
customers = pd.read_parquet('customers.parquet')
orders = pd.read_parquet('orders.parquet')
assert customers['customer_id'].is_unique, 'customer table must be unique by customer_id'
enriched = orders.merge(
customers[['customer_id', 'plan_tier', 'country']],
on='customer_id',
how='left',
validate='many_to_one',
indicator=True,
)
unmatched = enriched.loc[enriched['_merge'] != 'both', 'customer_id'].unique()
if len(unmatched) > 0:
raise ValueError(f'missing customer dimension rows for {len(unmatched)} customer_ids')
enriched = enriched.drop(columns=['_merge'])
print(enriched.head())
Merges are where silent data corruption often begins. I prefer explicit key audits, join cardinality validation, and indicator columns when investigating row loss or duplication. In production analytics, proving that a join is one_to_one or many_to_one is more valuable than making the code short.