I successfully initialise a cluster and train a DRF model. Then on the same cluster I try do a grid search for an XGBoost model.
H2OGridSearch(
H2OXGBoostEstimator(my_model_params),
hyper_params=my_grid_params,
search_criteria=my_search_criteria
)
Sometimes (not always) the grid search never finishes. Upon inspection in the H2O flow I found the job stuck at 0% progress with a 'RUNNING' status.
What I saw in the logs is the following
WARN: XGBoost task of type 'Boosting Iteration (tid=0)' is taking unexpectedly long, it didn't finish in 360 seconds.
WARN: XGBoost task of type 'Boosting Iteration (tid=0)' is taking unexpectedly long, it didn't finish in 420 seconds.
...
WARN: XGBoost task of type 'Boosting Iteration (tid=0)' is taking unexpectedly long, it didn't finish in 60240 seconds.
and after that I get
ERRR: water.api.HDFSIOException: HDFS IO Failure:
but the job's status is still 'RUNNING'.
I'm using h2o 3.30.0.6 via Python 3.7. The problem is that the error is not reproducible and sometimes it just works fine.
Any hints on how to track down the root cause?
Is there a parameter I can set for killing the whole job when a boosting iteration takes too long?