import pandas as pd
df = pd.read_csv('customers.csv')
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_', regex=False)
df['email'] = df['email'].str.strip().str.lower()
df['country'] = df['country'].fillna('unknown').str.strip().str.upper()
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')
df['lifetime_value'] = pd.to_numeric(df['lifetime_value'], errors='coerce')
df = df.drop_duplicates(subset=['email'], keep='last')
df = df[df['email'].notna()]
df['lifetime_value'] = df['lifetime_value'].fillna(df['lifetime_value'].median())
missing_summary = (
df.isna()
.mean()
.sort_values(ascending=False)
.rename('missing_ratio')
)
assert missing_summary.loc['email'] == 0, 'email should be fully populated'
Real data arrives dirty. I usually start with missing-value audits, duplicate removal, explicit type conversion, and canonical text cleanup. The trick is to make each cleanup rule reproducible rather than burying it in notebook state. I prefer small, composable transformations and assertions that fail loudly when source feeds drift.