1
votes

Say I have a learning curve that is sklearn learning curve SVM. And I'm also doing 5-fold cross-validation, which as far as I understand, it means splitting your training data into 5 pieces, train on four of them and testing on the last one.

So my question is, since for each data point in the LearningCurve, the size of the training set is different (Because we want to see how will the model perform with the increasing amount of data), how does the cross-validation work in that case? Does it still split the whole training set into 5 equal pieces? Or it splits the current point training set into five different small pieces, then computes the test score? Is it possible to get a confusion matrix for each data point? (i.e. True Positive, True Negative etc.). I don't see a way to do that yet based on the sklearn learning curve code.

Does how many folds of cross-validation relate to how many pieces of training set we are splitting in train_sizes = np.linspace(0.1, 1.0, 5).

train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(estimator,
                                                                      X, y, cv, 
                                                                      n_jobs, scoring, 
                                                                      train_sizes)

Thank you!

1

1 Answers

0
votes

No, it does the split the training data into 5 folds again. Instead, for a particular combination of training folds (for example - folds 1,2,3 and 4 as training), it will pick only k number of data points (x- ticks) as training from those 4 training folds. Test fold would be used as such as the testing data.

If you look at the code here it would become clearer for you.

for train, test in cv_iter:
     for n_train_samples in train_sizes_abs:
          train_test_proportions.append((train[:n_train_samples], test))

n_train_samples would be something like [200,400,...1400] for the plot that you had mentioned.

Does how many folds of cross-validation relate to how many pieces of training set we are splitting in train_sizes = np.linspace(0.1, 1.0, 5)?

we can't assign any number of folds for a certain train_sizes. It is just a subset of datapoints from all the training folds.