Natural language processing with spaCy pipelines and custom rules

I like spaCy for production NLP because it balances performance, ergonomics, and deployability. It is especially good for entity extraction, rule-based matching, and clean token-level processing. I often pair learned models with explicit match pattern

PCA and t-SNE for dimensionality reduction and inspection

I use dimensionality reduction both as a modeling tool and as an investigative lens. PCA is good for compression and signal inspection; t-SNE is useful when I need to see whether latent clusters or label separation exist at all. I never present those

Confusion matrix diagnostics for threshold selection

Thresholds are policy decisions disguised as numbers. I use confusion matrices to make those tradeoffs concrete for stakeholders: how many risky accounts we block, how many fraud attempts slip through, and how much manual review load is created. This

Classification metrics beyond accuracy for imbalanced problems

Accuracy is a bad comfort metric when the positive class is rare. I care more about precision, recall, PR AUC, calibration, and how thresholding changes operational workload. The right metric depends on the cost of false negatives versus false positiv

Bayesian optimization with Optuna for efficient model tuning

When the search space is wide, Optuna gives me better signal per compute dollar than brute-force sweeps. It is easy to define conditional search spaces, prune bad trials early, and track the best trial artifacts. I especially like it for gradient boos

Hyperparameter tuning with GridSearchCV and randomized search

Hyperparameter search should be targeted, not theatrical. I usually combine a strong baseline, a compact search space, and a metric aligned with business cost. GridSearchCV is good for interpretable sweeps; randomized search is better when the space g

ColumnTransformer pipelines that keep preprocessing honest

I push nearly all preprocessing into a Pipeline so training and inference paths share exactly the same logic. ColumnTransformer is the workhorse here because real-world tables mix numeric, categorical, boolean, and text fields. It gives you reproducib

Clustering with KMeans, DBSCAN, and hierarchical approaches

Unsupervised work gets much better when you compare clustering assumptions instead of treating one algorithm as truth. KMeans prefers spherical clusters, DBSCAN handles noise, and hierarchical clustering is useful when you want a multi-resolution view

Regression workflows with linear, ridge, lasso, and elastic net

For numeric targets I usually start simple and make regularization earn its keep. Ridge is stable, Lasso helps with sparsity, and ElasticNet is a practical compromise when correlated features exist. The main goal is not just minimizing RMSE but unders

Baseline classifiers in scikit-learn for fast benchmark setting

I like setting a few strong baselines before chasing complexity. A regularized logistic regression, a random forest, and a gradient boosting model usually tell me whether the problem is linearly separable, non-linear, or data-limited. Good baseline di

Train test split and stratified cross validation done properly

Evaluation goes wrong when data splitting is treated like boilerplate. I stratify imbalanced targets, guard time order when necessary, and make sure preprocessing lives inside cross-validation. This is the difference between a model that looks good in

Scaling and normalization choices for different model families

Not every model cares about scale, but enough of them do that I keep scaling explicit. Linear models, SVMs, neural nets, and distance-based methods all benefit from well-behaved inputs. I prefer putting scalers inside the pipeline so train and inferen