Encoding categorical variables without creating leakage

Categoricals are where good intentions become leakage. I use one-hot encoding for low-cardinality stable fields, ordinal encoders only when order is real, and frequency or target encoders with strict cross-validation boundaries. The encoder strategy s

Feature engineering for recency, frequency, and monetary behavior

Tabular models improve fast when you encode behavior rather than raw events. Recency, frequency, and monetary aggregates are durable baseline features for retention, fraud, and conversion use cases. I usually build them in pure pandas first, then port

Exploratory data analysis checklist for tabular ML projects

My EDA is opinionated because it has to answer modeling questions quickly. I care about label balance, leakage candidates, missingness patterns, monotonic relationships, and whether categorical levels explode in production. A repeatable checklist prev

Interactive Plotly figures for exploratory stakeholder reviews

Static plots are fine for papers, but product and business reviews often benefit from interactive filtering and hover details. I use Plotly when I need fast exploratory dashboards without spinning up a full app. It is especially useful for cohort anal

Statistical visualizations for distribution and drift analysis

I use distribution plots to decide whether a feature is stable enough to model, whether it needs transformation, or whether data drift is already happening. Seaborn makes it easy to compare classes, cohorts, or time windows. The visual check usually c

Matplotlib and Seaborn defaults that make charts publication ready

I spend a few minutes standardizing plotting defaults before I start analysis. Better typography, clear labels, and consistent palette choices reduce review cycles and improve notebook readability. Charts should explain themselves without requiring a

Linear algebra patterns for similarity and projection tasks

A lot of machine learning reduces to linear algebra with better tooling. Dot products, norms, matrix multiplication, and projections show up in recommendation, embeddings, PCA, and optimization. I keep the implementation small and testable so it stays

NumPy broadcasting for vectorized feature engineering

Good NumPy code replaces Python loops with array semantics that are easier to optimize and easier to benchmark. Broadcasting is the feature that makes those transformations elegant. I rely on it for normalization, distance calculations, and matrix-fri

Time series resampling and rolling windows in pandas

For operational metrics and forecasting features, I standardize timestamps first and then resample into stable windows. Rolling statistics like 7D means, lagged deltas, and volatility bands are easy wins for exploratory analysis. I avoid mixing timezo

Merging datasets safely with join keys and validation

Merges are where silent data corruption often begins. I prefer explicit key audits, join cardinality validation, and indicator columns when investigating row loss or duplication. In production analytics, proving that a join is one_to_one or many_to_on

GroupBy aggregations and pivot tables for business reporting

I reach for groupby when I need trustworthy aggregates that can power dashboards or analytical reports. Clear aggregation naming matters because these outputs frequently get joined back into feature tables or exported to BI systems. pivot_table is use

Cleaning missing values and normalizing messy CSV exports

Real data arrives dirty. I usually start with missing-value audits, duplicate removal, explicit type conversion, and canonical text cleanup. The trick is to make each cleanup rule reproducible rather than burying it in notebook state. I prefer small,