1
votes

I would like to make use of the 20 CPU cores I have at hand to train random forests in R. My usual code using randomForest package would be this:

rf = randomForest(Pred~., train, ntree=100, importance=TRUE)
rf

So I train a forest with 100 trees using a factor Pred with 11 levels and a dataframe train with 74 numeric features and ~84k observations.

The idea was to speed this up by using caret with my code (derived from this example):

cluster <- makeCluster(19)
registerDoParallel(cluster)
trainctrl <- trainControl(method="none", number=1, allowParallel=TRUE)
fit <- train(Driver~., train, method="parRF", trControl=trainctrl, ntree=100)
stopCluster(cluster)
registerDoSEQ()
fit

I replaced method=cv from the example with method=none as I do want to train on the whole training set (see documentation). However I do not get an accuracy from fit, fit$results is empty. If I set method=oob an optimization of mtry is done, which also gives me accuracies.

Is there a way to simply run the first code snippet in parallel using caret without any hyperparameter optimizations?

1
I would advice to use the ranger package. That runs parallel out of the box. method = "ranger" in caret.phiver
Thanks for the tip, that looks promissing on a first glimpse. I will look into that.Archer

1 Answers

-1
votes

This is an old question, but you can try using the doMC package (it will likely not work in Windows though).

Sample Code:

library(randomForest)
library(caret)
library(e1071)
library(doMC)

# Define the control
trControl <- trainControl(method = "cv",
    number = 10,
    search = "grid")

# Define number of parallel instances you want
registerDoMC(8)

# define parameters for grid search
tuneGrid <- expand.grid(.mtry = c(2: 5))

# train Random Forest model
rf_mtry <- train(TrainSet,yTrain,
    method = "rf",
    metric = "Accuracy",
    tuneGrid = tuneGrid,
    trControl = trControl,
    importance = TRUE,
    ntree = 300)

print(rf_mtry)

You can refer to this post as well.