scikit learn: learning curve without information leak?

Question

I would like to generate a learning curve for an LinearSVC estimator that is using countVectorizer to extract the features. The countVectorizer is also applying some feature selection step.

I could do the following:

fit the vectorizer on all data, including selection of top N features
use these features in fitting the linearSVC
use the linearSVC as the estimator in sklearn.model_selection.learning_curve()

But I think that it will result in information leak: information based on all data will be used to select features for the smaller sets used in the learning curve.

Is this correct? Is there a way to use the built-in sklearn.model_selection.learning_curve() with countVectorizer without information leak?

Thank you!

glemaitre glemaitre · Accepted Answer · 2019-12-18T18:18:46

You need to use a pipeline in conjunction with the learning_curve. The pipeline will call fit_transform of the transformer when training and only transform when testing. The learning_curve will also apply cross-validation which can be controlled by the parameter cv.

With this pipeline, there is no leak of information. Here, is an example using an integrated toy library in scikit-learn.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import learning_curve


categories = [
    'alt.atheism',
    'talk.religion.misc',
]
# Uncomment the following to do the analysis on all the categories
#categories = None

data = fetch_20newsgroups(subset='train', categories=categories)

pipeline = make_pipeline(
    CountVectorizer(), TfidfTransformer(), LinearSVC()
)

learning_curve(pipeline, data.data, data.target, cv=5)

scikit learn: learning curve without information leak?

1 Answers