Feature size for sklearn when using 10 fold Cross Validation

Question

I am confused the size of feature matrix when I used cross validation by sklearn. here is my code:

'''Cross-Validation'''
skf = cross_validation.StratifiedKFold(data_label, n_folds=10, shuffle=True, random_state=None)
'''For each fold, Do the classification'''
for train_index, test_index in skf:
    train_data = np.array(data_content[train_index])
    train_label = np.array(data_label[train_index])
    test_data = np.array(data_content[test_index])
    test_label = np.array(data_label[test_index])
    '''Create feature matrix'''
    cont_vect = CountVectorizer(analyzer='word')
    train_data_matrix = cont_vect.fit_transform(train_data)
    test_data_matrix = cont_vect.transform(test_data)............the classification

In every loop of 10 fold cross validation. What if the feature-document metrix (here is bag of words) created by the training dataset is different with test feature-document metrix. for example, word 'happy' is a feature in test dataset, but not in the training dataset. I'm not sure my code is correct, because here I used:

cont_vect.fit_transform

to create the training feature matrix, and use

cont_vect.transform

to create the test feature matrix, the code is working, but i don't know why, like what's difference between with fit_transform and transform? I presume that test matrix is created based on the training matrix.

If it's true, another question is that, should the feature size of each loop be the same? because, when using 10-fold CV, no matter training dataset is from what part of original dataset, the training+test(original dataset) is the same, so the size of feature matrix should be equal for each loop. But when I check the results, the size of feature is different, similar but not equal. I have no idea the reason why this happen? thanks.

lejlot lejlot · Accepted Answer · 2015-11-10T16:51:20

This is a very good question! Many young researchers in the field of ML forget about this issue.

So let us start from the end

what's difference between with fit_transform and transform?

Transformers in scikit-learn are classes that are able to transform one type of data to another, furthermore they often learn how to do so by analyzing the data.

.fit(X) makes a model learn some set of parameters from X
.transform(X) asks model to transform X using trained model
.fit_transform(X) is just a short form of calling .fit(X) followed by .transform(X), nothing more. As this is an often practise to transform the data we are learning from, this shortcut is truly widely used in sklearn based codes;

What if the feature-document metrix (here is bag of words) created by the training dataset is different with test feature-document metrix. for example, word 'happy' is a feature in test dataset, but not in the training dataset

It does what is the most reasonable thing to do - it ignores objects that were not seen during the training phase. For your particular case, when you call fit (inside fit_transform) on your training data, your transformer (vectorizer) built internal vocablurary - set of words seen during training. These are the only words that it recognizes. When you call transform on the new text, words that are not inside transformer vocablurary are simply ignored (because it does not know anything about them and obviously it would confuse the classifier on top of it). What if some words from vocablurary are not present in test set transformation? These values simply get 0 value (or other default meaning "there are no such objects now").

I'm not sure my code is correct, because here I used:
 cont_vect.fit_transform
to create the training feature matrix, and use
 cont_vect.transform

Yes, your code is perceftly fine, this is exactly what you are supposed to do.

To sum up.

What if the feature-document metrix (here is bag of words) created by the training dataset is different with test feature-document metrix.

The matrix is always the same, as it size (number of dimensions) is determined once you called fit, and following transform calls never affect the internal vocablurary (which size is the size of your matrix).

Feature size for sklearn when using 10 fold Cross Validation

1 Answers