scikit-learn

Regression workflows with linear, ridge, lasso, and elastic net

For numeric targets I usually start simple and make regularization earn its keep. Ridge is stable, Lasso helps with sparsity, and ElasticNet is a practical compromise when correlated features exist. The main goal is not just minimizing RMSE but unders

Encoding categorical variables without creating leakage

Categoricals are where good intentions become leakage. I use one-hot encoding for low-cardinality stable fields, ordinal encoders only when order is real, and frequency or target encoders with strict cross-validation boundaries. The encoder strategy s

Hyperparameter tuning with GridSearchCV and randomized search

Hyperparameter search should be targeted, not theatrical. I usually combine a strong baseline, a compact search space, and a metric aligned with business cost. GridSearchCV is good for interpretable sweeps; randomized search is better when the space g

Serving scikit-learn models behind a FastAPI prediction API

Deployment should not rewrite the feature logic from scratch. I expose trained pipelines behind FastAPI so the exact preprocessing and estimator objects travel together. Strong request schemas and explicit model versioning keep this boring in the righ

Text vectorization with TF-IDF for strong classical baselines

Before I fine-tune transformers, I almost always try a TF-IDF baseline. It is fast, interpretable, and often surprisingly competitive for moderate text classification tasks. If a linear model over sparse features is already good enough, that is usuall

ColumnTransformer pipelines that keep preprocessing honest

I push nearly all preprocessing into a Pipeline so training and inference paths share exactly the same logic. ColumnTransformer is the workhorse here because real-world tables mix numeric, categorical, boolean, and text fields. It gives you reproducib

Train test split and stratified cross validation done properly

Evaluation goes wrong when data splitting is treated like boilerplate. I stratify imbalanced targets, guard time order when necessary, and make sure preprocessing lives inside cross-validation. This is the difference between a model that looks good in

Baseline classifiers in scikit-learn for fast benchmark setting

I like setting a few strong baselines before chasing complexity. A regularized logistic regression, a random forest, and a gradient boosting model usually tell me whether the problem is linearly separable, non-linear, or data-limited. Good baseline di