machine-learning

Scaling and normalization choices for different model families

Not every model cares about scale, but enough of them do that I keep scaling explicit. Linear models, SVMs, neural nets, and distance-based methods all benefit from well-behaved inputs. I prefer putting scalers inside the pipeline so train and inferen

Exploratory data analysis checklist for tabular ML projects

My EDA is opinionated because it has to answer modeling questions quickly. I care about label balance, leakage candidates, missingness patterns, monotonic relationships, and whether categorical levels explode in production. A repeatable checklist prev

Bayesian optimization with Optuna for efficient model tuning

When the search space is wide, Optuna gives me better signal per compute dollar than brute-force sweeps. It is easy to define conditional search spaces, prune bad trials early, and track the best trial artifacts. I especially like it for gradient boos

Data validation contracts with Pandera for pipeline reliability

I use schema validation to stop bad data before it poisons training or inference. Pandera lets me express expectations around types, nullability, ranges, and uniqueness in code that can run in CI or orchestration jobs. This catches upstream breakage e

Baseline classifiers in scikit-learn for fast benchmark setting

I like setting a few strong baselines before chasing complexity. A regularized logistic regression, a random forest, and a gradient boosting model usually tell me whether the problem is linearly separable, non-linear, or data-limited. Good baseline di