h2o in R: model calculation hangs

Question

I cannot figure out why my random forest grid search hangs. I tried many things suggested on Stackoverflow, but nothing works. First of all, here is my code:

library(data.table)
library(h2o)
library(dplyr)

# Initialise H2O
localH2O = h2o.init(nthreads = -1, min_mem_size = "9240M", max_mem_size = "11336M")

h2o.removeAll()

# Specify some dirs, inputs etc. (not shown)
laufnummer  <- 10
set.seed(laufnummer)
maxmodels   <- 500

# Convert to h2o
h2o_input <- as.h2o(input)

# Split: 80% = train; 0 = valid; rest = 20% = test
splits <- h2o.splitFrame(h2o_input, c(0.80,0))
train  <- h2o.assign(splits[[1]], "train") # 80%
test   <- h2o.assign(splits[[3]], "test")  # 10%

Set parameters:

# Select range of ntrees
min_ntrees      <- 10
max_ntrees      <- 2500
stepsize_ntrees <- 20
ntrees_opts <- seq(min_ntrees,max_ntrees, stepsize_ntrees)

# Select range of tries
min_mtries      <- 1
max_mtries      <- 12
stepsize_mtries <- 1
mtries_opts <- seq(min_mtries,max_mtries, stepsize_mtries)

# Cross-validation number of folds
nfolds <- 5

hyper_params_dl = list(ntrees = ntrees_opts,
                       mtries = mtries_opts)
search_criteria_dl = list(
  strategy = "RandomDiscrete",
  max_models = maxmodels)

Finally, the random grid search (this is where it hangs, almost always at 25%)

rf_grid <- h2o.grid(seed = laufnummer,
                    algorithm = "randomForest", 
                    grid_id = "dlgrid",
                    x = predictors, 
                    y = response, 
                    training_frame = train,
                    nfolds = nfolds,
                    keep_cross_validation_predictions = TRUE,
                    model_id = "rf_grid",
                    hyper_params = hyper_params_dl,
                    search_criteria = search_criteria_dl
)

Here is what I already tried:

Did not set nthreads in init: no effect.
Set nthreads to 4: no effect.
Set lower memory (I have 16 GB): no effect.
Added parallelism = 0 in grid search: no effect
Did not use h2o.removeAll(): no effect
Always used h2o.shutdown(prompt = FALSE) at end: no effect
Used different version of JDK, R and h2o. (now using the latest ones for all)

The problem is that the grid search progress stops at around 25%, sometimes less.

What does help is to switch the code to GBM instead of RF, but it sometimes hangs there as well (and I need RF!). What also helped was to reduce the number of models to 500 instead of 5000, but only for NN and GBM, not RF.

After trying for some weeks now, I would appreciate any help very much!! Thank you!

UPDATE: Thanks for your suggestions, here is what I tried: 1. Imported already split files with h2o.importfile(): no effect No surprise, because it is such a small data set and loading takes a few secs. 2. Set nthreads to 1: no effect 3. Do not use xgboost: I am not aware that I use it. 4. Do not use RF: Not possible, because I try to compare machine learning algorithms. 5. h2o.init(jvm_custom_args = c("-XX:+PrintGCDetails", "-XX:+PrintGCTimeStamps")): Did not work, because h2o would not start with this parameter added. 6. Bought an additional 8 GB of RAM and set max_mem_size to 18 and 22 GB respectively: effect = stops at about 65% and 80% instead of 25%. What is interesting is that the progress bar gets slower and slower until it stops completely. Then something like a hard reset takes place since I use a different keyboard layout (Win10) and that is set to the default... Note: 500 GBM or NN run fine with the same data set. 7. Reduced number of models to 300: no effect.

So, my conclusion is that it is definitely a memory issue, but I cannot really monitor it. The RAM in the Task manager is not at 100%, but at the allocated max_mem_size. Any help that can help me to pinpoint the problem further is greatly appreciated - thanks guys!!

Looks like you run out of resources. Have you tried an AWS / Azure cluster? — phiver
Thanks. No, I did not try a cluster. I do not care if it takes 1-2 days on my machine, but it should not hang... And it does not help to constrain to 1 CPU, so that is probably (?) not the reason. — litotes
You might want to contact h2o directly on their h2o gitter stream and link to this SO question. But they will need to know the size of your dataset. — phiver

TomKraljevic TomKraljevic · Accepted Answer · 2020-02-02T17:15:38

Most likely the cause of the hang is running out of memory. You either need to use less memory, or run your job on a system with more memory.

There are a number of factors at work here, and it's not necessarily obvious how to debug them unless you are aware of the underlying resource usage.

Below are three sections with suggestions about how to monitor memory usage, how to reduce memory usage, and how get a system with more memory.

Here are some memory monitoring suggestions:

Monitor your physical memory usage. Do this with using a program like top on the Mac or on Linux. An important number to look at is RSS (resident set size), which represents the actual amount of physical memory being used on the host.
Monitor any swapping. Make sure your system is not swapping to disk. Swapping occurs when you are trying to use more virtual memory (at one time) than you have physical memory on your host. On linux, the vmstat command is good for showing swapping.
Turn on java GC logging with -XX:+PrintGCDetails -XX:+PrintGCTimeStamps you will get more log output which will show you if java itself is just bogging down from running out of memory. This is very likely. Here is an example of how to do that by passing the jvm_custom_args flag when starting H2O-3 from inside of R:

h2o.init(jvm_custom_args = c("-XX:+PrintGCDetails", "-XX:+PrintGCTimeStamps"))

You will see a message showing:

H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /var/folders/vv/pkzvhy8x5hsfbsjg75_6q4ch0000gn/T//RtmpUsdTRQ/h2o_tomk_started_from_r.out
    /var/folders/vv/pkzvhy8x5hsfbsjg75_6q4ch0000gn/T//RtmpUsdTRQ/h2o_tomk_started_from_r.err

The .out file above will now contain GC log output, as you can see below:

...
02-02 08:30:29.785 127.0.0.1:54321       21814  main      INFO: Open H2O Flow in your web browser: http://127.0.0.1:54321
02-02 08:30:29.785 127.0.0.1:54321       21814  main      INFO:
02-02 08:30:29.886 127.0.0.1:54321       21814  #84503-22 INFO: GET /, parms: {}
02-02 08:30:29.946 127.0.0.1:54321       21814  #84503-20 INFO: GET /, parms: {}
02-02 08:30:29.959 127.0.0.1:54321       21814  #84503-21 INFO: GET /, parms: {}
02-02 08:30:29.980 127.0.0.1:54321       21814  #84503-22 INFO: GET /3/Capabilities/API, parms: {}
02-02 08:30:29.981 127.0.0.1:54321       21814  #84503-22 INFO: Locking cloud to new members, because water.api.schemas3.CapabilitiesV3
02-02 08:30:30.005 127.0.0.1:54321       21814  #84503-25 INFO: GET /3/InitID, parms: {}
14.334: [GC (Allocation Failure) [PSYoungGen: 94891K->3020K(153088K)] 109101K->56300K(299008K), 0.0193290 secs] [Times: user=0.22 sys=0.01, real=0.02 secs]
14.371: [GC (Allocation Failure) [PSYoungGen: 120914K->3084K(153088K)] 174194K->173560K(338432K), 0.0256458 secs] [Times: user=0.29 sys=0.04, real=0.03 secs]
14.396: [Full GC (Ergonomics) [PSYoungGen: 3084K->0K(153088K)] [ParOldGen: 170475K->163650K(435200K)] 173560K->163650K(588288K), [Metaspace: 22282K->22282K(1069056K)], 0.0484233 secs] [Times: user=0.47 sys=0.00, real=0.05 secs]
14.452: [GC (Allocation Failure) [PSYoungGen: 118503K->160K(281088K)] 282153K->280997K(716288K), 0.0273738 secs] [Times: user=0.30 sys=0.05, real=0.02 secs]
14.479: [Full GC (Ergonomics) [PSYoungGen: 160K->0K(281088K)] [ParOldGen: 280837K->280838K(609792K)] 280997K->280838K(890880K), [Metaspace: 22282K->22282K(1069056K)], 0.0160751 secs] [Times: user=0.09 sys=0.00, real=0.02 secs]
14.516: [GC (Allocation Failure) [PSYoungGen: 235456K->160K(281088K)] 516294K->515373K(890880K), 0.0320757 secs] [Times: user=0.30 sys=0.10, real=0.03 secs]
14.548: [Full GC (Ergonomics) [PSYoungGen: 160K->0K(281088K)] [ParOldGen: 515213K->515213K(969216K)] 515373K->515213K(1250304K), [Metaspace: 22282K->22282K(1069056K)], 0.0171208 secs] [Times: user=0.09 sys=0.00, real=0.02 secs]

The "Allocation Failure" messages look scary, but are actually totally normal. The time to worry is when you see back-to-back Full GC cycles that take a large number of "real secs".

Here are some suggestions for using less memory:

Split the data once and save it to disk, and then read it back in a new fresh H2O-3 cluster in two separate as.h2o or h2o.importFile steps.

In your example, you are doing a splitFrame. This makes a duplicate copy of your data in memory.
Prefer h2o.importFile to as.h2o.

I don't know how much of a difference this really makes in your case, but h2o.importFile was designed and tested for big data, and as.h2o was not.
Use less data.

You have not said anything about the shape of your data, but if the automl or grid search works with GBM but not DRF, that is definitely pointing to running out of memory. Those two algorithms do almost exactly the same thing computation-wise, but DRF models tend to be larger since DRF has a higher tree depth, which means it needs more memory to store the models.
Use the nthreads option to reduce the number of concurrent worker threads.

The more active concurrent threads you have running, the more memory you need, because each thread needs some working memory. You can try setting nthreads to half of the number of CPU cores you have, for example.
Don't use xgboost.

Xgboost is special in the way that it uses memory because it makes a second copy of the data outside of the java heap. This means when you are using xgboost, you don't want to give the java max_mem_size (or Xmx) your entire host's memory or you can run into problems (especially swapping).
Don't use DRF.

DRF trees are deeper, and hence the produced models are larger. Alternately, build (and retain in memory) fewer DRF models, more shallow DRF models, or models with fewer trees.

The best quick suggestion for getting more memory is to run in the cloud. You don't necessarily need a multi-node setup. A single large node is easier to work with if that will adequately solve the problem. In your case it likely will. Given what you have said above (which is that you have 16 GB now and it finishes if you don't use DRF), I would start by using, an m5.4xlarge instance in EC2 which has 64 GB of RAM and costs under $1 / hr and give it a max_mem_size of 48G.

h2o in R: model calculation hangs

1 Answers