python 23 lines · 1 tab

Train test split and stratified cross validation done properly

Dr. Elena Vasquez Apr 2026

1 tab

from sklearn.model_selection import StratifiedKFold, train_test_split, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=42,
)

pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=1000)),
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_validate(pipeline, X_train, y_train, cv=cv, scoring=['roc_auc', 'f1', 'precision', 'recall'])
print(scores)

1 file · python Explain with highlit

Evaluation goes wrong when data splitting is treated like boilerplate. I stratify imbalanced targets, guard time order when necessary, and make sure preprocessing lives inside cross-validation. This is the difference between a model that looks good in a notebook and one that behaves predictably in production.

Share this code

Here's the card — post it anywhere.

Train test split and stratified cross validation done properly — share card

Link copied