R caret randomforest

Question

Using the defaults of the train in caret package, I am trying to train a random forest model for the dataset xtr2 (dim(xtr2): 765 9408). The problem is that it unbelievably takes too long (more than one day for one training) to fit the function. As far as I know train in its default uses bootstrap sampling (25 times) and three random selection of mtry, so why it should take so long? Please notice that I need to train the rf, three times in each run (because I need to make a mean of the results of different random forest models with the same data), and it takes about three days, and I need to run the code for 10 different samples, so it would take me 30 days to have the results.

My question is how I can make it faster?

Can changing the defaults of train make the operation time less? for example using CV for training?
Can parallel processing with caret package help? if yes, how it can be done?
Can tuneRF of random forest package make any changes to the time?

This is the code:

rffit=train(xtr2,ytr2,method="rf",ntree=500)
rf.mdl =randomForest(x=xtr2,y=as.factor(ytr2),ntree=500,
                     keep.forest=TRUE,importance=TRUE,oob.prox =FALSE ,
                     mtry = rffit$bestTune$mtry)

Thank you,

What takes 24 hours: train or randomForest ? What is the value of rffit$bestTune$mtry ? Did you try to call randomForest or train on same parameters on small samples of the data (say 50 elements) and see what is going on? Did you try, on these small samples, play with parameters: keep.forest, importance, oob.prox, mtry? — user31264
@user31264: the train is the bottleneck, and rffit$bestTune$mtry is 9407... — user6845158

bgreenwell bgreenwell · Accepted Answer · 2016-11-28T17:23:26

My thoughts on your questions:

Yes! But don't forget you also have control over the search grid caret uses for the tuning parameters; in this case, mtry. I'm not sure what the default search grid is for mtry, but try the following:

ctrl <- trainControl("cv", number = 5, verboseIter = TRUE)

set.seed(101) # for reproducibility

rffit <- train(xtr2, ytr2, method = "rf", trControl = ctrl, tuneLength = 5)
Yes! See the caret website: http://topepo.github.io/caret/parallel-processing.html
Yes and No! tuneRF simply uses the OOB error to find an optimal value of mtry (the only tuning parameter in randomForest). Using cross-validation tends to work better and produce a more honest estimate of model performance. tuneRF can take a long time but should be quicker than k-fold cross-validation.

Overall, the online manual for caret is quite good: http://topepo.github.io/caret/index.html.

Good luck!

R caret randomforest

3 Answers