1
votes

Using the defaults of the train in caret package, I am trying to train a random forest model for the dataset xtr2 (dim(xtr2): 765 9408). The problem is that it unbelievably takes too long (more than one day for one training) to fit the function. As far as I know train in its default uses bootstrap sampling (25 times) and three random selection of mtry, so why it should take so long? Please notice that I need to train the rf, three times in each run (because I need to make a mean of the results of different random forest models with the same data), and it takes about three days, and I need to run the code for 10 different samples, so it would take me 30 days to have the results.

My question is how I can make it faster?

  1. Can changing the defaults of train make the operation time less? for example using CV for training?

  2. Can parallel processing with caret package help? if yes, how it can be done?

  3. Can tuneRF of random forest package make any changes to the time?

This is the code:

rffit=train(xtr2,ytr2,method="rf",ntree=500)
rf.mdl =randomForest(x=xtr2,y=as.factor(ytr2),ntree=500,
                     keep.forest=TRUE,importance=TRUE,oob.prox =FALSE ,
                     mtry = rffit$bestTune$mtry)

Thank you,

3
can you share your sample dataset?Sandipan Dey
dim(xtr2): 765 9408 what does it mean?user31264
What takes 24 hours: train or randomForest ? What is the value of rffit$bestTune$mtry ? Did you try to call randomForest or train on same parameters on small samples of the data (say 50 elements) and see what is going on? Did you try, on these small samples, play with parameters: keep.forest, importance, oob.prox, mtry?user31264
@sandipan ,yes, I am looking where I can share the data ?user6845158
@user31264: the train is the bottleneck, and rffit$bestTune$mtry is 9407...user6845158

3 Answers

2
votes

My thoughts on your questions:

  1. Yes! But don't forget you also have control over the search grid caret uses for the tuning parameters; in this case, mtry. I'm not sure what the default search grid is for mtry, but try the following:

    ctrl <- trainControl("cv", number = 5, verboseIter = TRUE)

    set.seed(101) # for reproducibility

    rffit <- train(xtr2, ytr2, method = "rf", trControl = ctrl, tuneLength = 5)

  2. Yes! See the caret website: http://topepo.github.io/caret/parallel-processing.html

  3. Yes and No! tuneRF simply uses the OOB error to find an optimal value of mtry (the only tuning parameter in randomForest). Using cross-validation tends to work better and produce a more honest estimate of model performance. tuneRF can take a long time but should be quicker than k-fold cross-validation.

Overall, the online manual for caret is quite good: http://topepo.github.io/caret/index.html.

Good luck!

3
votes

You use train for determining mtry only. I would skip the train step, and stay with default mtry:

rf.mdl =randomForest(x=xtr2,y=as.factor(ytr2),ntree=500,
                    keep.forest=TRUE,importance=TRUE,oob.prox =FALSE)

I strongly doubt that 3 different runs is a good idea.

If you do 10 fold cross-validation (I am not sure it should be done anyways, as validation is ingrained into the random forest), 10 parts is too much, if you are short in time. 5 parts would be enough.

Finally, the time of randomForest is proportional to nTree. Set nTree=100, and your program will run 5 time faster.

0
votes

I would also just add, that it the main issue is speed, there are several other random forest implementations in caret, and many of them are much faster than the original randomForest which is notoriously slow. I've found ranger to be a nice alternative that suited my very simple needs.

Here is a nice summary of the random forest packges in R. Many of these are in caret already.

Also for consideration, here's an interesting study of the performance of ranger vs rborist, where you can see how performance is affected by the tradeoff between sample size and features.