OpenCV image preprocessing for OCR and vision pipelines

A lot of computer vision performance comes from cleaner inputs rather than larger models. I use OpenCV for resizing, denoising, thresholding, and contour extraction when preparing images for OCR or downstream classification. These classical steps ofte

Geospatial analysis with GeoPandas for location intelligence

Location data becomes useful when spatial joins and distance-based features are handled correctly. GeoPandas is enough for many routing, service coverage, and market analysis tasks before you need heavier GIS infrastructure. I care about coordinate sy

Regular expressions for extracting structured entities from raw text

Regex is not glamorous, but it remains one of the fastest ways to turn messy text into useful structured fields. I use it for IDs, dates, codes, and log fragments before reaching for heavier NLP. The important part is making patterns specific enough t

Web scraping pipelines with requests and BeautifulSoup

For lightweight data collection, I prefer reliable HTML parsing over brittle browser automation. That means stable headers, polite rate limiting, retries, and explicit extraction rules. If scraping becomes core infrastructure, then I graduate it into

SQL window functions for feature extraction and behavioral ranking

A surprising amount of feature engineering is best done in SQL before Python ever runs. ROW_NUMBER, LAG, rolling windows, and partitioned aggregates are ideal for deriving customer behavior signals close to the source. I use SQL here when it reduces m

Great Expectations checks for dataset health before retraining

Before retraining, I want hard guarantees that the data feed still looks structurally sane. Great Expectations gives teams a shared validation language that analysts, ML engineers, and data engineers can all inspect. I use it to codify invariants that

Data validation contracts with Pandera for pipeline reliability

I use schema validation to stop bad data before it poisons training or inference. Pandera lets me express expectations around types, nullability, ranges, and uniqueness in code that can run in CI or orchestration jobs. This catches upstream breakage e

Experiment tracking and model registry workflows with MLflow

If experiments matter, they should be searchable after the notebook is closed. MLflow gives me parameter tracking, metric history, artifact storage, and a lightweight model registry without much ceremony. It is one of the fastest ways to make a small

Serializing models with joblib, pickle, and ONNX tradeoffs

Model serialization is not just a file-format choice. It affects startup time, compatibility, portability, and security boundaries. I use joblib for common scikit-learn pipelines, reserve pickle for trusted internal workflows, and reach for ONNX when

Serving scikit-learn models behind a FastAPI prediction API

Deployment should not rewrite the feature logic from scratch. I expose trained pipelines behind FastAPI so the exact preprocessing and estimator objects travel together. Strong request schemas and explicit model versioning keep this boring in the righ

Anomaly detection with isolation forest and robust thresholds

Anomaly detection is mostly about defining normal behavior well enough that deviations matter. I usually combine a model like IsolationForest with feature windows and operational thresholds that the business can interpret. Without that calibration, an

Time series forecasting with statsmodels SARIMAX baselines

For many business forecasting tasks, a carefully tuned statistical baseline is still the right first step. SARIMAX makes seasonality, trend, and external regressors explicit, which is useful when stakeholders want understandable drivers. I use it befo