I have a data frame containing 499 observations and 1412 variables. I split my data frame into train and test set and try the train set in Caret 5 fold cross validation by Random Forest method. My question is that how the cross-validation with Random Forest method chooses values of mtry? if you look at the plot, for example, why doesn't the procedure choose 30 as the statring value of mtry?
1 Answers
To answer this one needs to check the train code for the rf model.
From the linked code it is clear that if grid search is specified caret will use caret::var_seq
function to generate mtry.
mtry = caret::var_seq(p = ncol(x),
classification = is.factor(y),
len = len)
From the help for the function it can be seen that if the number of predictors is less than 500, a simple sequence of values of length len is generated between 2 and p. For larger numbers of predictors, the sequence is created using log2 steps.
so for example:
caret::var_seq(p = 1412,
classification = T,
len = 3)
#output
[1] 2 53 1412
If len = 1
is specified the defaults from the randomForest package are used:
mtry = if (!is.null(y) && !is.factor(y))
max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x)))
if a random search is specified then caret calculates mtry as:
unique(sample(1:ncol(x), size = len, replace = TRUE)
in other words for your case:
unique(sample(1:1412 , size = 3, replace = TRUE))
#output
[1] 857 181 64
here is an example:
library(caret)
#some data
z <- matrix(rnorm(100000), ncol = 1000)
colnames(z) = paste0("V", 1:1000)
#specify model evaluation
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 1)
#train
fit_rf <- train(V1 ~.,
data = z,
method = "rf",
tuneLength = 3,
trControl = ctrl)
fit_rf$results
#output
mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 2 0.8030665 0.11101385 0.5889436 0.2824439 0.09644324 0.1650381
2 44 0.8146023 0.09481331 0.6014367 0.2821711 0.10082099 0.1665926
3 998 0.8420705 0.03190199 0.6375570 0.2503089 0.03205335 0.1550021
same mtry values as one would obtain by doing:
caret::var_seq(p = 999,
classification = F,
len = 3)
When random search is specified:
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
search = "random")
fit_rf <- train(V1 ~.,
data = z,
method = "rf",
tuneLength = 3,
trControl = ctrl)
fit_rf$results
#output
mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 350 0.8571330 0.10195986 0.6214896 0.1637944 0.1385415 0.09904165
2 826 0.8644918 0.07775553 0.6286101 0.1725390 0.1264605 0.10587076
3 855 0.8636692 0.07025535 0.6232729 0.1754164 0.1332580 0.10438083
or some other random numbers obtained by:
unique(sample(1:999 , size = 3, replace = TRUE))
To fix the mtry to desired values it is best to provide your own search grid. A tutorial on how to do that and much more can be found here.