1
votes

I have a data frame containing 499 observations and 1412 variables. I split my data frame into train and test set and try the train set in Caret 5 fold cross validation by Random Forest method. My question is that how the cross-validation with Random Forest method chooses values of mtry? if you look at the plot, for example, why doesn't the procedure choose 30 as the statring value of mtry?

enter image description here

1

1 Answers

3
votes

To answer this one needs to check the train code for the rf model.

From the linked code it is clear that if grid search is specified caret will use caret::var_seq function to generate mtry.

mtry = caret::var_seq(p = ncol(x), 
                      classification = is.factor(y), 
                      len = len)

From the help for the function it can be seen that if the number of predictors is less than 500, a simple sequence of values of length len is generated between 2 and p. For larger numbers of predictors, the sequence is created using log2 steps.

so for example:

caret::var_seq(p = 1412, 
               classification = T, 
               len = 3)
#output
[1]    2   53 1412

If len = 1 is specified the defaults from the randomForest package are used:

mtry = if (!is.null(y) && !is.factor(y))
       max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x)))

if a random search is specified then caret calculates mtry as:

unique(sample(1:ncol(x), size = len, replace = TRUE)

in other words for your case:

unique(sample(1:1412 , size = 3, replace = TRUE))
#output
[1] 857 181  64

here is an example:

library(caret)
#some data
z <- matrix(rnorm(100000), ncol = 1000)
colnames(z) = paste0("V", 1:1000)
#specify model evaluation
ctrl <- trainControl(method = "repeatedcv",
                     number = 10,
                     repeats = 1)
#train
fit_rf <- train(V1 ~.,
            data = z,
            method = "rf",
            tuneLength = 3,
            trControl = ctrl)
fit_rf$results
#output
  mtry      RMSE   Rsquared       MAE    RMSESD RsquaredSD     MAESD
1    2 0.8030665 0.11101385 0.5889436 0.2824439 0.09644324 0.1650381
2   44 0.8146023 0.09481331 0.6014367 0.2821711 0.10082099 0.1665926
3  998 0.8420705 0.03190199 0.6375570 0.2503089 0.03205335 0.1550021

same mtry values as one would obtain by doing:

caret::var_seq(p = 999, 
               classification = F, 
               len = 3)

When random search is specified:

ctrl <- trainControl(method = "repeatedcv",
                     number = 10,
                     repeats = 1,
                     search = "random")

fit_rf <- train(V1 ~.,
                data = z,
                method = "rf",
                tuneLength = 3,
                trControl = ctrl)
fit_rf$results
#output
  mtry      RMSE   Rsquared       MAE    RMSESD RsquaredSD      MAESD
1  350 0.8571330 0.10195986 0.6214896 0.1637944  0.1385415 0.09904165
2  826 0.8644918 0.07775553 0.6286101 0.1725390  0.1264605 0.10587076
3  855 0.8636692 0.07025535 0.6232729 0.1754164  0.1332580 0.10438083

or some other random numbers obtained by:

unique(sample(1:999 , size = 3, replace = TRUE))

To fix the mtry to desired values it is best to provide your own search grid. A tutorial on how to do that and much more can be found here.