4
votes

I'm a little confused how caret scores the test folds in k-fold cross validation.

I'd like to generate a data frame or matrix containing the scored records of the ten test datasets in 10-fold cross validation.

For example, using the iris dataset to train a decision tree model:

install.packages("caret", dependencies=TRUE) 

library(caret)

data(iris)

train_control <- trainControl(method="cv", number=10, savePredictions = TRUE), 

model <- train(Species ~ ., data=iris, trControl=train_control, method="rpart")

model$pred

The model$pred command lists predictions for ten folds in 450 records.

This doesn't seem right - shouldn't model$pred produce predictions for the 150 records in the ten test folds (1/10 * 150 = 15 records per test fold)? How are 450 records generated?

1

1 Answers

5
votes

By default, train iterates over three values for the complexity parameter cp of rpart (see ?rpart.control):

library(caret)
data(iris)
train_control <- trainControl(method="cv", number=10, savePredictions = TRUE) 

model <- train(Species ~ ., 
               data=iris, 
               trControl=train_control, 
               method="rpart")
nrow(model$pred)
# [1] 450
length(unique(model$pred$cp))
# [1] 3

You can change that for example by explicitly specifying cp=0.05:

model <- train(Species ~ ., 
               data=iris, 
               trControl=train_control, 
               method="rpart", 
               tuneGrid = data.frame(cp = 0.05))
nrow(model$pred)
# [1] 150
length(unique(model$pred$cp))
# [1] 1

or by using tuneLength=1 instead of the default 3:

model <- train(Species ~ ., 
               data=iris, 
               trControl=train_control, 
               method="rpart", 
               tuneLength = 1)
nrow(model$pred)
# [1] 150
length(unique(model$pred$cp))
# [1] 1