h2o AutoML vs h2o XGBoost - model metrics

Question

I have fairly small dataset: 15 columns, 3500 rows and I am consistenly seeing that xgboost in h2o trains better model than h2o AutoML. I am using H2O 3.26.0.2 and Flow UI.

H2O XGBoost finishes in a matter of seconds while AutoML takes as long as it needs (20 mins) and always gives me worse performance.

I admit dataset might not be perfect but I would expect that AutoML with gridsearch would be as good (or better) than h2o XGBoost. My thinking is that AutoML will train multiple XGBoost model and do gridsearch on hyperparameters so it should be similar, right?

For both AutoML and XGBoost I use same training dataset and same response column.

Code for running experiment with XGBoost is:

import h2o
from h2o.estimators.xgboost import H2OXGBoostEstimator

h2o_frame = h2o.import_file(path="myFile.csv")

feature_columns = h2o_frame.columns
label_column = "responseColumn"
feature_columns.remove(label_column)

xgb = H2OXGBoostEstimator(nfolds=10, seed=1)

xgb.train(x=feature_columns, y=label_column, training_frame=h2o_frame)

# now export metrics to file
MRD = xgb.mean_residual_deviance()
RMSE= xgb.rmse()
MSE= xgb.mse()
MAE= xgb.mae()
RMSLE= xgb.rmsle()

header = ['model','mean_residual_deviance','rmse','mse','mae','rmsle']

with open('metrics.out', mode='w') as result_file:
    writer = csv.writer(result_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    writer.writerow(header)
    writer.writerow(['H2O_XGBoost', MRD, RMSE, MSE, MAE, RMSLE])

Code for running experiment with AutoML is:

import h2o
from h2o.automl import H2OAutoML

h2o_frame = h2o.import_file(path="myFile.csv")

feature_columns = h2o_frame.columns
label_column = "responseColumn"
feature_columns.remove(label_column)

aml = H2OAutoML(seed=1, nfolds=10, exclude_algos=["StackedEnsemble"], max_models=20)

aml.train(x=feature_columns, y=label_column, training_frame=h2o_frame)

# now export metrics to file
h2o.export_file(aml.leaderboard, "metrics.out", force = True, parts = 1)

Tried using different nfold, more models for AutoML, increasing early stopping rounds. I tried excluding all algorithms from AutoML (except XGBoost) and I still get same results.

Here are the differences in results:

H2O XGBoost:

model   xgboost-5a8f9766-940c-4e5c-b57d-62b186f4c058
model_checksum  7409831159060775248
frame   train_set_v01.hex
frame_checksum  6864971999838167226
description ·
model_category  Regression
scoring_time    1566296468447
predictions ·
MSE 252.265021
RMSE    15.882853
nobs    3476
custom_metric_name  ·
custom_metric_value 0
r2  0.726871
mean_residual_deviance  252.265021
mae 10.709369
rmsle   NaN

XGBoost native params for xgboost-5a8f9766-940c-4e5c-b57d-62b186f4c058:

name    value
silent  true
eta 0.3
colsample_bylevel   1
objective   reg:linear
min_child_weight    1
nthread 8
seed    -1058380797
max_depth   6
colsample_bytree    1
lambda  1
gamma   0
alpha   0
booster gbtree
grow_policy depthwise
nround  50
subsample   1
max_delta_step  0
tree_method auto

H2O AutoML (winning model):

model   StackedEnsemble_AllModels_AutoML_20190819_235446
model_checksum  -6727284429527535576
frame   automl_training_train_set_v01.hex
frame_checksum  6864971999838167226
description ·
model_category  Regression
scoring_time    1566256209073
predictions ·
MSE 332.146239
RMSE    18.224880
nobs    3476
custom_metric_name  ·
custom_metric_value 0
r2  0.640383
mean_residual_deviance  332.146239
mae 12.927023
rmsle   1.225650
residual_deviance   1154540.326762
null_deviance   3210476.302359
AIC 30070.640602
null_degrees_of_freedom 3475
residual_degrees_of_freedom 3464

And the best rated XGBoost model from same AutoML (third in the leaderboard):

model   XGBoost_grid_1_AutoML_20190819_235446_model_5
model_checksum  8047828446507408480
frame   automl_training_train_set_v01.hex
frame_checksum  6864971999838167226
description ·
model_category  Regression
scoring_time    1566255442068
predictions ·
MSE 616.910151
RMSE    24.837676
nobs    3476
custom_metric_name  ·
custom_metric_value 0
r2  0.332068
mean_residual_deviance  616.910151
mae 17.442629
rmsle   1.325149

XGBoost native params (for XGBoost_grid_1_AutoML_20190819_235446_model_5 in AutoML):

name    value
silent  true
normalize_type  tree
eta 0.05
objective   reg:linear
colsample_bylevel   0.8
nthread 8
seed    940795529
min_child_weight    15
rate_drop   0
one_drop    0
sample_type uniform
max_depth   20
colsample_bytree    1
lambda  100
gamma   0
alpha   0.1
booster dart
grow_policy depthwise
skip_drop   0
nround  120
subsample   0.8
max_delta_step  0
tree_method auto

H2O AutoML contains XGBoost, so you should see equal of better performance in AutoML. Are you scoring these on a separate test set? How are you evaluating them? What parameters for XGBoost are you using? Please update the post with code. — Erin LeDell
thanks Erin - updated with code and with XGBoost params as extracted from Flow. I am pretty sure we are doing something silly here. — anthony
I would expect the same. From your experiments, it seems like default hyperparams used by H2O if you call XGBoost directly are different from what AutoML uses and lies outside of the space of the direct hyperparams. Apparently the AutoML hyperparam space H2O uses for XGBoost is not well suited to your particular dataset. If this is true it could be an issue with AutoML. You could open an issue on the H2O github repo. — ilmiacs
I was able to reproduce exact same behaviour with Churn_Modelling.csv found here sds-platform-private.s3-us-east-2.amazonaws.com/uploads/… - if you use Exited column as response column you can reproduce scenario where XGBoost in 8 seconds gives better result than AutoML after one hour — anthony

Erin LeDell Erin LeDell · Accepted Answer · 2019-09-02T23:07:36

The problem here is that you are comparing training metrics for XGBoost to CV metrics for AutoML models.

The code you posted for the manual XGBoost models provides training metrics. Instead, you will need to grab the CV metrics if you want to make a fair comparison to the performance of the models in AutoML (CV metrics are reported by default in the AutoML leaderboard and that's what you're reporting in your code).

Change this:

# now export metrics to file
MRD = xgb.mean_residual_deviance()
RMSE= xgb.rmse()
MSE= xgb.mse()
MAE= xgb.mae()
RMSLE= xgb.rmsle()

To:

# now export metrics to file
MRD = xgb.mean_residual_deviance(xval=True)
RMSE= xgb.rmse(xval=True)
MSE= xgb.mse(xval=True)
MAE= xgb.mae(xval=True)
RMSLE= xgb.rmsle(xval=True)

The description of the metrics and what they return is in the Python module docs.

Once you make this change you should see the issue resolved and have comparable performance between the manual XGBoost models and the AutoML models.

h2o AutoML vs h2o XGBoost - model metrics

1 Answers