python 21 lines · 1 tab

Merging datasets safely with join keys and validation

Dr. Elena Vasquez Apr 2026

1 tab

import pandas as pd

customers = pd.read_parquet('customers.parquet')
orders = pd.read_parquet('orders.parquet')

assert customers['customer_id'].is_unique, 'customer table must be unique by customer_id'

enriched = orders.merge(
    customers[['customer_id', 'plan_tier', 'country']],
    on='customer_id',
    how='left',
    validate='many_to_one',
    indicator=True,
)

unmatched = enriched.loc[enriched['_merge'] != 'both', 'customer_id'].unique()
if len(unmatched) > 0:
    raise ValueError(f'missing customer dimension rows for {len(unmatched)} customer_ids')

enriched = enriched.drop(columns=['_merge'])
print(enriched.head())

1 file · python Explain with highlit

Merges are where silent data corruption often begins. I prefer explicit key audits, join cardinality validation, and indicator columns when investigating row loss or duplication. In production analytics, proving that a join is one_to_one or many_to_one is more valuable than making the code short.

Share this code

Here's the card — post it anywhere.

Merging datasets safely with join keys and validation — share card

Link copied