I am confused the size of feature matrix when I used cross validation by sklearn. here is my code:
'''Cross-Validation'''
skf = cross_validation.StratifiedKFold(data_label, n_folds=10, shuffle=True, random_state=None)
'''For each fold, Do the classification'''
for train_index, test_index in skf:
train_data = np.array(data_content[train_index])
train_label = np.array(data_label[train_index])
test_data = np.array(data_content[test_index])
test_label = np.array(data_label[test_index])
'''Create feature matrix'''
cont_vect = CountVectorizer(analyzer='word')
train_data_matrix = cont_vect.fit_transform(train_data)
test_data_matrix = cont_vect.transform(test_data)............the classification
In every loop of 10 fold cross validation. What if the feature-document metrix (here is bag of words) created by the training dataset is different with test feature-document metrix. for example, word 'happy' is a feature in test dataset, but not in the training dataset. I'm not sure my code is correct, because here I used:
cont_vect.fit_transform
to create the training feature matrix, and use
cont_vect.transform
to create the test feature matrix, the code is working, but i don't know why, like what's difference between with fit_transform and transform? I presume that test matrix is created based on the training matrix.
If it's true, another question is that, should the feature size of each loop be the same? because, when using 10-fold CV, no matter training dataset is from what part of original dataset, the training+test(original dataset) is the same, so the size of feature matrix should be equal for each loop. But when I check the results, the size of feature is different, similar but not equal. I have no idea the reason why this happen? thanks.