h2o task taking unexpectedly long leading to it getting stuck

Question

I successfully initialise a cluster and train a DRF model. Then on the same cluster I try do a grid search for an XGBoost model.

H2OGridSearch(
    H2OXGBoostEstimator(my_model_params),
    hyper_params=my_grid_params,
    search_criteria=my_search_criteria
)

Sometimes (not always) the grid search never finishes. Upon inspection in the H2O flow I found the job stuck at 0% progress with a 'RUNNING' status.
What I saw in the logs is the following

WARN: XGBoost task of type 'Boosting Iteration (tid=0)' is taking unexpectedly long, it didn't finish in 360 seconds.  
WARN: XGBoost task of type 'Boosting Iteration (tid=0)' is taking unexpectedly long, it didn't finish in 420 seconds.
...
WARN: XGBoost task of type 'Boosting Iteration (tid=0)' is taking unexpectedly long, it didn't finish in 60240 seconds.

and after that I get

ERRR: water.api.HDFSIOException: HDFS IO Failure:

but the job's status is still 'RUNNING'.

I'm using h2o 3.30.0.6 via Python 3.7. The problem is that the error is not reproducible and sometimes it just works fine.

Any hints on how to track down the root cause?
Is there a parameter I can set for killing the whole job when a boosting iteration takes too long?

Neema Mashayekhi Neema Mashayekhi · Accepted Answer · 2020-09-08T18:05:05

For XGBoost, if it becomes unresponsive, you may need to allocate additional memory for it, since it uses memory independent of H2O (algortihms)

Why does my H2O cluster on Hadoop became unresponsive when running XGBoost even when I supplied 4 times the datasize memory?

This is why the extramempercent option exists, and we recommend setting this to a high value, such as 120. What happens internally is that when you specify -node_memory 10G and -extramempercent 120, the h2o driver will ask Hadoop for 10𝐺∗(1+1.2)=22𝐺 of memory. At the same time, the h2o driver will limit the memory used by the container JVM (the h2o node) to 10G, leaving the 10𝐺∗120 memory “unused.” This memory can be then safely used by XGBoost outside of the JVM. Keep in mind that H2O algorithms will only have access to the JVM memory (10GB), while XGBoost will use the native memory for model training. For example: hadoop jar h2odriver.jar -nodes 1 -mapperXmx 20g -extramempercent 120

Source

h2o task taking unexpectedly long leading to it getting stuck

1 Answers