3
votes

I'm trying to run recursive feature elimination for a random forest on a data frame containing 27 predictor variables, each with 3653 values. So there's 98631 values total in the predictor dataframe. I'm using the rfe function from the package caret.

require(caret)
require(randomForest)

subsets <- c(1:5, 10, 15, 20, 25)

set.seed(10)

ctrl <- rfeControl(functions = rfFuncs,
                   method = "repeatedcv",
                   repeats = 5,
                   verbose = FALSE,
                   allowParallel=TRUE)

rfProfile <- rfe(predictors, 
                 y,
                 sizes = subsets,
                 rfeControl = ctrl)

I'm using allowParallel=TRUE in rfeControl, hoping that it will run the process in parallel on my Windows machine. But I'm not sure if it's doing that, since I do not see any decrease in run time after setting allowParallel=TRUE. The process takes a very long time, and I've had to interrupt the kernal after 1-2 hours each time.

How do I know if caret is running the RFE in parallel? Do I need to install any other parallelization packages for caret to run this process in parallel?

Any help/suggestions will be much appreciated! I'm new to the machine learning world, so it's taking me a while to figure things out.

1
Not a caret user, but I guess the implementation use forking which is not supported by windows. For your size of data set that should take few minutes unless also coupled with parameter tuning. A stand-alone-function using foreach and doParallel (supported by windows) could be written in ~25 linesSoren Havelund Welling
I must say, without any intent to contribute a meaningful response, that your question title was very intriguing. (And I have a nagging fear that pkg:caret is a disguised multiple comparisons engine.)IRTFM
If you're on Windows, you can always pull up task manager and see if all of your processors are actually working. I don't think this is a robust solution, but it at least gives you an idea.Alex W
Thank you, @SorenH.Welling. I'm a R newbie, so the task of writing parallel code seems slightly daunting. But I will give it a go sometime soon!small_world
@BondedDust Again, not adding anything to the response, might I ask you what a multiple comparison engine is? I tried searching for the meaning, but couldn't find a reliable description.small_world

1 Answers

5
votes

Try installing and registering the doParallel package prior to running rfe. This seemed to work on my Windows machine.

Here's a lengthy example pulled from the caret documentation with timing before and after using doParallel

subsetSizes <- c(2, 4, 6, 8)
set.seed(123)
seeds <- vector(mode = "list", length = 51)
for(i in 1:50) seeds[[i]] <- sample.int(1000, length(subsetSizes) + 1)
seeds[[51]] <- sample.int(1000, 1)

data(BloodBrain)

Run without parallel processing

set.seed(1)
system.time(rfMod <- rfe(bbbDescr, logBBB,
         sizes = subsetSizes,
         rfeControl = rfeControl(functions = rfFuncs, 
                                 seeds = seeds,
                                 number = 50)))

   user  system elapsed 
 113.32    0.44  114.43 

Register parallel

library(doParallel) 
cl <- makeCluster(detectCores(), type='PSOCK')
registerDoParallel(cl)

Run with parallel processing

set.seed(1)
system.time(rfMod <- rfe(bbbDescr, logBBB,
         sizes = subsetSizes,
         rfeControl = rfeControl(functions = rfFuncs, 
                                 seeds = seeds,
                                 number = 50)))

   user  system elapsed 
   1.57    0.01   56.27