I want to parallelize the model fitting process for xgboost while using caret. From what I have seen in xgboost's documentation, the nthread
parameter controls the number of threads to use while fitting the models, in the sense of, building the trees in a parallel way. Caret's train
function will perform parallelization in the sense of, for example, running a process for each iteration in a k-fold CV. Is this understanding correct, if yes, is it better to:
- Register the number of cores (for example, with the
doMC
package and theregisterDoMC
function), setnthread=1
via caret's train function so it passes that parameter to xgboost, setallowParallel=TRUE
intrainControl
, and letcaret
handle the parallelization for the cross-validation; or - Disable caret parallelization (
allowParallel=FALSE
and no parallel back-end registration) and setnthread
to the number of physical cores, so the parallelization is contained exclusively within xgboost.
Or is there no "better" way to perform the parallelization?
Edit: I ran the code suggested by @topepo, with tuneLength = 10
and search="random"
, and specifying nthread=1
on the last line (otherwise I understand that xgboost will use multithreading). There are the results I got:
xgb_par[3]
elapsed
283.691
just_seq[3]
elapsed
276.704
mc_par[3]
elapsed
89.074
just_seq[3]/mc_par[3]
elapsed
3.106451
just_seq[3]/xgb_par[3]
elapsed
0.9753711
xgb_par[3]/mc_par[3]
elapsed
3.184891
At the end, it turned out that both for my data and for this test case, letting caret handle the parallelization was a better choice in terms of runtime.
library(microbenchmark)
for that. – Alex W