sklearn: vectorizing in cross validation for text classification

Question

I have a question about using cross validation in text classification in sklearn. It is problematic to vectorize all data before cross validation, because the classifier would have "seen" the vocabulary occurred in the test data. Weka has filtered classifier to solve this problem. What is the sklearn equivalent for this function? I mean for each fold, the feature set would be different because the training data are different.

I think this question may have a better reception on the Stack Exchange machine learning and statistics site named "Cross Validated." — waTeim
This question appears to be off-topic because it belongs on stats.stackexchange.com — Kevin Panko
Clarification: it is not off topic because this question is particularly for CV for text classification in sklearn. Numerical data wouldn't have this problem because the feature set is fixed for any fold, but it is different for every fold in text classification. — user3466018

Fred Foo Fred Foo · Accepted Answer · 2014-03-27T10:07:01

The scikit-learn solution to this problem is to cross-validate a Pipeline of estimators, e.g.:

>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.svm import LinearSVC
>>> clf = Pipeline([('vect', TfidfVectorizer()), ('svm', LinearSVC())])

clf is now a composite estimator that does feature extraction and SVM model fitting. Given a list of documents (i.e. an ordinary Python list of string) documents and their labels y, calling

>>> cross_val_score(clf, documents, y)

will do feature extraction in each fold separately so that each of the SVMs knows only the vocabulary of its (k-1) folds training set.

sklearn: vectorizing in cross validation for text classification

1 Answers