I cannot figure out why my random forest grid search hangs. I tried many things suggested on Stackoverflow, but nothing works. First of all, here is my code:
library(data.table)
library(h2o)
library(dplyr)
# Initialise H2O
localH2O = h2o.init(nthreads = -1, min_mem_size = "9240M", max_mem_size = "11336M")
h2o.removeAll()
# Specify some dirs, inputs etc. (not shown)
laufnummer <- 10
set.seed(laufnummer)
maxmodels <- 500
# Convert to h2o
h2o_input <- as.h2o(input)
# Split: 80% = train; 0 = valid; rest = 20% = test
splits <- h2o.splitFrame(h2o_input, c(0.80,0))
train <- h2o.assign(splits[[1]], "train") # 80%
test <- h2o.assign(splits[[3]], "test") # 10%
Set parameters:
# Select range of ntrees
min_ntrees <- 10
max_ntrees <- 2500
stepsize_ntrees <- 20
ntrees_opts <- seq(min_ntrees,max_ntrees, stepsize_ntrees)
# Select range of tries
min_mtries <- 1
max_mtries <- 12
stepsize_mtries <- 1
mtries_opts <- seq(min_mtries,max_mtries, stepsize_mtries)
# Cross-validation number of folds
nfolds <- 5
hyper_params_dl = list(ntrees = ntrees_opts,
mtries = mtries_opts)
search_criteria_dl = list(
strategy = "RandomDiscrete",
max_models = maxmodels)
Finally, the random grid search (this is where it hangs, almost always at 25%)
rf_grid <- h2o.grid(seed = laufnummer,
algorithm = "randomForest",
grid_id = "dlgrid",
x = predictors,
y = response,
training_frame = train,
nfolds = nfolds,
keep_cross_validation_predictions = TRUE,
model_id = "rf_grid",
hyper_params = hyper_params_dl,
search_criteria = search_criteria_dl
)
Here is what I already tried:
- Did not set nthreads in init: no effect.
- Set nthreads to 4: no effect.
- Set lower memory (I have 16 GB): no effect.
- Added parallelism = 0 in grid search: no effect
- Did not use h2o.removeAll(): no effect
- Always used h2o.shutdown(prompt = FALSE) at end: no effect
- Used different version of JDK, R and h2o. (now using the latest ones for all)
The problem is that the grid search progress stops at around 25%, sometimes less.
What does help is to switch the code to GBM instead of RF, but it sometimes hangs there as well (and I need RF!). What also helped was to reduce the number of models to 500 instead of 5000, but only for NN and GBM, not RF.
After trying for some weeks now, I would appreciate any help very much!! Thank you!
UPDATE: Thanks for your suggestions, here is what I tried: 1. Imported already split files with h2o.importfile(): no effect No surprise, because it is such a small data set and loading takes a few secs. 2. Set nthreads to 1: no effect 3. Do not use xgboost: I am not aware that I use it. 4. Do not use RF: Not possible, because I try to compare machine learning algorithms. 5. h2o.init(jvm_custom_args = c("-XX:+PrintGCDetails", "-XX:+PrintGCTimeStamps")): Did not work, because h2o would not start with this parameter added. 6. Bought an additional 8 GB of RAM and set max_mem_size to 18 and 22 GB respectively: effect = stops at about 65% and 80% instead of 25%. What is interesting is that the progress bar gets slower and slower until it stops completely. Then something like a hard reset takes place since I use a different keyboard layout (Win10) and that is set to the default... Note: 500 GBM or NN run fine with the same data set. 7. Reduced number of models to 300: no effect.
So, my conclusion is that it is definitely a memory issue, but I cannot really monitor it. The RAM in the Task manager is not at 100%, but at the allocated max_mem_size. Any help that can help me to pinpoint the problem further is greatly appreciated - thanks guys!!