1
votes

I've trying to tune a random forest model using the tuneRF tool included in the randomForest Package and I'm also using the caret package to tune my model. The issue is that I'm tunning to get mtry and I'm getting different results for each approach. The question is how do I know which is the best approach and base on what? I'm not clear if I should expect similar or different results.

tuneRF: with this approach I'm getting the best mtry is 3

t <- tuneRF(train[,-12], train[,12],
        stepFactor = 0.5,
        plot = TRUE,
        ntreeTry = 100,
        trace = TRUE,
        improve = 0.05)

caret: With this approach I'm always getting that the best mtry is all variables in this case 6

control <- trainControl(method="cv", number=5)
tunegrid <- expand.grid(.mtry=c(2:6))
set.seed(2)
custom <- train(CRTOT_03~., data=train, method="rf", metric="rmse", 
                tuneGrid=tunegrid, ntree = 100, trControl=control)
1

1 Answers

2
votes

There are a few differences, for each mtry parameters, tuneRF fits one model on the whole dataset, and you get the OOB error from each of these fit. tuneRF then takes the lowest OOB error. For each value of mtry, you have one score (or RMSE value) and this will change with different runs.

In caret, you actually do cross-validation, so the test data from the fold was not used at all in the model. Though in principle it should be similar to OOB, you should be aware of the differences.

A evaluation with a better picture on the error might be to run tuneRF a few rounds, and we can use cv in caret:

library(randomForest)
library(mlbench)
data(BostonHousing)
train <- BostonHousing

tuneRF_res = lapply(1:10,function(i){

tr = tuneRF(train[,-14], train[,14],mtryStart=2,step=0.9,ntreeTry = 100,trace = TRUE,improve=1e-5)
tr = data.frame(tr)
tr$RMSE = sqrt(tr[,2])
tr
})

tuneRF_res = do.call(rbind,tuneRF_res)

control <- trainControl(method="cv", number=10,returnResamp="all")
tunegrid <- expand.grid(.mtry=c(2:7))
caret_res <- train(medv ~., data=train, method="rf", metric="RMSE", 
                tuneGrid=tunegrid, ntree = 100, trControl=control)

library(ggplot2)
df = rbind(
data.frame(tuneRF_res[,c("mtry","RMSE")],test="tuneRF"),
data.frame(caret_res$resample[,c("mtry","RMSE")],test="caret")
)
df = df[df$mtry!=1,]

ggplot(df,aes(x=mtry,y=RMSE,col=test))+
stat_summary(fun.data=mean_se,geom="errorbar",width=0.2) +
stat_summary(fun=mean,geom="line") + facet_wrap(~test)

enter image description here

You can see the trend is more or less similar. My suggestion would be to use tuneRF to quickly check the range of mtrys to train over, then use caret, cross-validation to properly evaluate this.