from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
pipeline = Pipeline([
('vectorizer', TfidfVectorizer(max_features=30_000, ngram_range=(1, 2), min_df=3)),
('model', LogisticRegression(max_iter=2000, class_weight='balanced')),
])
pipeline.fit(train_texts, train_labels)
predictions = pipeline.predict(valid_texts)
print(classification_report(valid_labels, predictions, digits=3))
Before I fine-tune transformers, I almost always try a TF-IDF baseline. It is fast, interpretable, and often surprisingly competitive for moderate text classification tasks. If a linear model over sparse features is already good enough, that is usually the correct production choice.