Understanding cross_val_score in kfold scitkit learn

Question

Reading doc for k fold cross validation http://scikit-learn.org/stable/modules/cross_validation.html I'm attempting to understand the training procedure for each of the folds.

Is this correct : In generating the cross_val_score each fold contains a new training and test set , these training and test sets are utilized by the passed in classifier clf in below code for evaluating each fold performance ?

This implies that increasing size of fold can affect accuracy depending on size of training set as increase number of folds reduces training data available for each fold ?

From doc cross_val_score is generated using :

from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
scores                                              
array([ 0.96...,  1.  ...,  0.96...,  0.96...,  1.        ])

You are right in the sense that it can affect accuracy AND in the sense that each fold will have in principle different training in testing data. However, the data is split in K folds and for every fold of testing data, the other K-1 folds are used as training data. So, there is a rather large overlap in training data for each fold which is tested. — Uvar

binjip binjip · Accepted Answer · 2017-09-25T14:28:27

I don't think the statement "each fold contains a new training and test set" is correct.

By default, cross_val_score uses KFold cross-validation. This works by splitting the data set into K equal folds. Say we have 3 folds (fold1, fold2, fold3), then the algorithm works as follows:

Use fold1 and fold2 as your training set in svm and test performance on fold3.
Use fold1 and fold3 as our training set in svm and test performance on fold2.
Use fold2 and fold3 as our training set in svm and test performance on fold1.

So each fold is used for both training and testing.

Now to second part of your question. If you increase the number of rows of data in a fold, you do reduce the number of training samples for each of the runs (above, that would be run 1, 2, and 3) but the total number of training samples is unchanged.

Generally, selecting the right number of folds is both art and science. For some heuristics on how to choose your number of folds, I would suggest this answer. The bottom line is that accuracy can be slightly affected by your choice of the number of folds. For large data sets, you are relatively safe with a large number of folds; for smaller data sets, you should run the exercise multiple times with new random splits.

Understanding cross_val_score in kfold scitkit learn

1 Answers