I would like to generate a learning curve for an LinearSVC estimator that is using countVectorizer to extract the features. The countVectorizer is also applying some feature selection step.
I could do the following:
fit the vectorizer on all data, including selection of top N features
use these features in fitting the linearSVC
- use the linearSVC as the estimator in sklearn.model_selection.learning_curve()
But I think that it will result in information leak: information based on all data will be used to select features for the smaller sets used in the learning curve.
Is this correct? Is there a way to use the built-in sklearn.model_selection.learning_curve() with countVectorizer without information leak?
Thank you!