2
votes

In the caret documentation, the p argument of trainControl refers to "For leave-group out cross-validation: the training percentage".

Could anyone please explain the difference of the following when defining a 10-fold cross validation for passing to the train function of the caret package -

(a). control <- trainControl(method = "cv", number = 10, p=.9)

(b). control <- trainControl(method = "cv", number = 10)

As an example, say if we have a data set of 10,000 observations. As we have 10 folds for both, for (a), my understanding is that each fold would have 1,000 of the observations and each time nine folds with a total of 9000 observations (90%) would be used for training. For (b), 9 folds for training would together have 750 (75%) of the observations (as the default value of p=.75) with the remaining fold for testing always having 250 (25%) observations. Is my understanding correct ?

Do you call both 10-fold cross validation ? Or should the first one be called leave-group out 10-fold cross-validation ?

1
Welcome to SO. When quoting from somewhere (including documentation), it is good practice to also include the relevant link. Kindly edit & update your question accordingly.desertnaut
Thanks. I have added the link to the documentation.bkfha

1 Answers

3
votes

When using

control <- trainControl(method = "LGOCV", number = 10, p=.9)

you will perform 10 repetitions of leave group validation where for each repetition 90% (p = 0.9) of the data will be sampled at random and used for training while the remaining 10% of data will be used for testing. This is also called Monte-Carlo Cross Validation. Usually more repetitions are performed with MC-CV.

When using

control <- trainControl(method = "cv", number = 10)

the data set will be split into 10 parts and 10 resampling iterations will be performed. In each iteration 9 parts will be used for training and the remaining part will be used for testing. This will progress until all the parts are used once for testing. This is called K-fold cross validation. In this case K is 10.

Further reading: https://stats.stackexchange.com/questions/51416/k-fold-vs-monte-carlo-cross-validation

In each case around 90% of the data will be used for training in each resampling iteration.

EDIT: I have changed the above code

control <- trainControl(method = "cv", number = 10, p=.9)

to

control <- trainControl(method = "LGOCV", number = 10, p=.9)

since if you specify cv you will actually preform k-fold CV and the p argument will be ignored. So in order to perform leave group out cross validation you must specify method = "LGOCV" and then the p argument will be used to determine the train/test split ratio.