Train test split and stratified cross validation done properly

Evaluation goes wrong when data splitting is treated like boilerplate. I stratify imbalanced targets, guard time order when necessary, and make sure preprocessing lives inside cross-validation. This is the difference between a model that looks good in

Scaling and normalization choices for different model families

Not every model cares about scale, but enough of them do that I keep scaling explicit. Linear models, SVMs, neural nets, and distance-based methods all benefit from well-behaved inputs. I prefer putting scalers inside the pipeline so train and inferen

Encoding categorical variables without creating leakage

Categoricals are where good intentions become leakage. I use one-hot encoding for low-cardinality stable fields, ordinal encoders only when order is real, and frequency or target encoders with strict cross-validation boundaries. The encoder strategy s

Feature engineering for recency, frequency, and monetary behavior

Tabular models improve fast when you encode behavior rather than raw events. Recency, frequency, and monetary aggregates are durable baseline features for retention, fraud, and conversion use cases. I usually build them in pure pandas first, then port

Exploratory data analysis checklist for tabular ML projects

My EDA is opinionated because it has to answer modeling questions quickly. I care about label balance, leakage candidates, missingness patterns, monotonic relationships, and whether categorical levels explode in production. A repeatable checklist prev

Interactive Plotly figures for exploratory stakeholder reviews

Static plots are fine for papers, but product and business reviews often benefit from interactive filtering and hover details. I use Plotly when I need fast exploratory dashboards without spinning up a full app. It is especially useful for cohort anal

Statistical visualizations for distribution and drift analysis

I use distribution plots to decide whether a feature is stable enough to model, whether it needs transformation, or whether data drift is already happening. Seaborn makes it easy to compare classes, cohorts, or time windows. The visual check usually c

Matplotlib and Seaborn defaults that make charts publication ready

I spend a few minutes standardizing plotting defaults before I start analysis. Better typography, clear labels, and consistent palette choices reduce review cycles and improve notebook readability. Charts should explain themselves without requiring a

Linear algebra patterns for similarity and projection tasks

A lot of machine learning reduces to linear algebra with better tooling. Dot products, norms, matrix multiplication, and projections show up in recommendation, embeddings, PCA, and optimization. I keep the implementation small and testable so it stays

NumPy broadcasting for vectorized feature engineering

Good NumPy code replaces Python loops with array semantics that are easier to optimize and easier to benchmark. Broadcasting is the feature that makes those transformations elegant. I rely on it for normalization, distance calculations, and matrix-fri

Time series resampling and rolling windows in pandas

For operational metrics and forecasting features, I standardize timestamps first and then resample into stable windows. Rolling statistics like 7D means, lagged deltas, and volatility bands are easy wins for exploratory analysis. I avoid mixing timezo

Merging datasets safely with join keys and validation

Merges are where silent data corruption often begins. I prefer explicit key audits, join cardinality validation, and indicator columns when investigating row loss or duplication. In production analytics, proving that a join is one_to_one or many_to_on