12
votes

I want to parallelize the model fitting process for xgboost while using caret. From what I have seen in xgboost's documentation, the nthread parameter controls the number of threads to use while fitting the models, in the sense of, building the trees in a parallel way. Caret's train function will perform parallelization in the sense of, for example, running a process for each iteration in a k-fold CV. Is this understanding correct, if yes, is it better to:

  1. Register the number of cores (for example, with the doMC package and the registerDoMC function), set nthread=1 via caret's train function so it passes that parameter to xgboost, set allowParallel=TRUE in trainControl, and let caret handle the parallelization for the cross-validation; or
  2. Disable caret parallelization (allowParallel=FALSE and no parallel back-end registration) and set nthread to the number of physical cores, so the parallelization is contained exclusively within xgboost.

Or is there no "better" way to perform the parallelization?

Edit: I ran the code suggested by @topepo, with tuneLength = 10 and search="random", and specifying nthread=1 on the last line (otherwise I understand that xgboost will use multithreading). There are the results I got:

xgb_par[3]
elapsed  
283.691 
just_seq[3]
elapsed 
276.704 
mc_par[3]
elapsed 
89.074 
just_seq[3]/mc_par[3]
elapsed 
3.106451 
just_seq[3]/xgb_par[3]
elapsed 
0.9753711 
xgb_par[3]/mc_par[3]
elapsed 
3.184891

At the end, it turned out that both for my data and for this test case, letting caret handle the parallelization was a better choice in terms of runtime.

1
Avoiding the fact that cross-validation is not "model fitting", there's no reason these options must be mutually exclusive. Regardless, the question is opinion based and I'm voting to close. You haven't defined "better"; but, I assume you mean less run-time... You can always profile your code. I suggest library(microbenchmark) for that.Alex W
Maybe there's a misunderstanding in terminology. Of course, the final goal of cross-validation is model validation, but what I meant by "model fitting" is that in each iteration, you do have to fit a model over the (k-1) folds. The reason for this question is that I do not know if by construction, there's a theoretically better way to do the parallelization (e.g. there could be more overhead in spawning more threads per iteration, than parallelizing the resampling loop), and was wondering if someone more experienced could advise into this. But it is true that this might be case dependent.drgxfs

1 Answers

10
votes

It is not simple to project what the best strategy would be. My (biased) thought is that you should parallelize the process that takes the longest. Here, that would be the resampling loop since an open thread/worker would invoke the model many times. The opposite approach of parallelizing the model fit will start and stop workers repeatedly and theoretically slows things down. Your mileage may vary.

I don't have OpenMP installed but there is code below to test (if you could report your results, that would be helpful).

library(caret)
library(plyr)
library(xgboost)
library(doMC)

foo <- function(...) {
  set.seed(2)
  mod <- train(Class ~ ., data = dat, 
               method = "xgbTree", tuneLength = 50,
               ..., trControl = trainControl(search = "random"))
  invisible(mod)
}

set.seed(1)
dat <- twoClassSim(1000)

just_seq <- system.time(foo())


## I don't have OpenMP installed
xgb_par <- system.time(foo(nthread = 5))

registerDoMC(cores=5)
mc_par <- system.time(foo())

My results (without OpenMP)

> just_seq[3]
elapsed 
326.422 
> xgb_par[3]
elapsed 
319.862 
> mc_par[3]
elapsed 
102.329 
> 
> ## Speedups
> xgb_par[3]/mc_par[3]
elapsed 
3.12582 
> just_seq[3]/mc_par[3]
 elapsed 
3.189927 
> just_seq[3]/xgb_par[3]
 elapsed 
1.020509