1
votes

Considering 5 fold cross-validation in Caret with Random Forest method, what are the properties of Random Forest built in each fold? For example in iris data set :

train_control <- trainControl(method="cv", number=5,savePredictions = TRUE) 
output <- train(Species~., data=iris, trControl=train_control, method="rf")
output$results$mtry
[1] 2 3 4

Is it true that having 3 mtry values, 3 different forests are built in cross validation? how can I understand the details of each fold forest like mtry?

1

1 Answers

2
votes

by default the caret train function will do a grid search for best mtry. If not supplied with the length of the grid search, it will do a search of length 3.

These defaults can be seen from:

?trainControl
?train

tuneLength = ifelse(trControl$method == "none", 1, 3))
search = "grid"

When a grid search is specified (default) and length 3 (default), the mtry parameters are found using the caret function var_seq. This can be seen from the rf train method. This function has different behavior depending on the number of features. With less then 500 features it chooses mtry as:

floor(seq(2, to = p, length = len))

where p is the number of features. Iris data has 4 features so with a len of 3 available mtry values are 2, 3 and 4.

Hence these three mtry values are all tested in 5 fold CV. So basically 15 rf models are made. 5 per each mtry. At the end, based on the CV results the best mtry is selected and a final model is built on the whole train data - the 16th model.