python 13 lines · 1 tab

Text vectorization with TF-IDF for strong classical baselines

Dr. Elena Vasquez Apr 2026

1 tab

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(max_features=30_000, ngram_range=(1, 2), min_df=3)),
    ('model', LogisticRegression(max_iter=2000, class_weight='balanced')),
])

pipeline.fit(train_texts, train_labels)
predictions = pipeline.predict(valid_texts)
print(classification_report(valid_labels, predictions, digits=3))

1 file · python Explain with highlit

Before I fine-tune transformers, I almost always try a TF-IDF baseline. It is fast, interpretable, and often surprisingly competitive for moderate text classification tasks. If a linear model over sparse features is already good enough, that is usually the correct production choice.

Share this code

Here's the card — post it anywhere.

Text vectorization with TF-IDF for strong classical baselines — share card

Link copied