3
votes

I am trying to use caret to cross-validate an elastic net model using the glmnet implementation on an Ubuntu machine with 8 CPU cores & 32 GB of RAM. When I train sequentially, I am maxing out CPU usage on one core, but using 50% of the memory on average.

  • When I use doMC(cores = xxx), do I need to worry about only registering xxx = floor(100/y) cores, where y is the memory usage of the model when using a single core (in %), in order to not run out of memory?

  • Does caret have any heuristics that allow it to figure out the max. number of cores to use?

  • Is there any set of heuristics that I can use to dynamically adjust the number of cores to use my computing resources optimally across different sizes of data and model complexities?


Edit:

FWIW, attempting to use 8 cores made my machine unresponsive. Clearly caret does not check to see if the spawning xxx processes is likely to be problematic. How can I then choose the number of cores dynamically?

1
library(parallel) detectCores() This is a way to determine how many cores are available. The handling of the return value is OS dependent, but it would be interesting to know how many R thinks is available to it in your setup. May return a number smaller than 8.Mike
@Mike That is not the point. I know that there are 8 cores, but assume that there are fewer. I will run out of memory, in this case much before 4, or 6 or 8 cores are used. What I need is a heuristic and efficient way to figure out how many cores to register for the problem at hand.tchakravarty

1 Answers

3
votes

Clearly caret does not check to see if the spawning xxx processes is likely to be problematic.

True; it cannot predict future performance of your computer.

You should get an understanding of how much memory you use for modeling use when running sequentially. You can start the training and use top or other methods to estimate the amount of ram used then kill the process. If sequentially you use X GB of RAM sequentially, running on M cores will require X(M+1) GB of ram.