1
votes

I'm reading caret documentation here.

I've used method cv in the past for cross validation, but in this case I'd like to use a simple split of 90% training and 10% hold out for testing.

I suppose I could do folds = 1 but wondered if there's a prescribed way of doing this within caret?

Within the documentation the parameters available for method within trainControl() are given as:

The resampling method: boot, boot632, cv, repeatedcv, LOOCV, LGOCV (for repeated training/test splits), none (only fits one model to the entire training set), oob (only for random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models), "adaptive_cv", "adaptive_boot" or "adaptive_LGOCV"

But I'm not sure what these mean. Maybe one of them would be the one I need?

1

1 Answers

1
votes

One solution is to create the train/test splits outside of caret and use the index argument of trainControl to make caret use these data partitions.

This requires a list of vectors of train indices. such an object is easily created by using the caret::createDataPartition() function.

library(caret)
library(MASS)

set.seed(1234)

# create four 50/50 partitions
parts <- createDataPartition(Boston$medv, times = 4, p = 0.5)

ctrl <- trainControl(method = "repeatedcv", 
                     ## The method doesn't matter
                     ## since we are defining the resamples
                     index= parts, 
                     savePredictions = TRUE
                     ) 
res <- train(medv ~ indus + chas, data = Boston, method = "lm",
             trControl = ctrl)

res

Note that createDataPartition creates splits that are stratified on the outcome variable. I ended up creating my own data partition function to create truly random partitions. This was for teaching purposes. My impression is that stratified sampling on the outcome is in practice virtually always preferable over random sampling.