1
votes

I am building gbm model using h2o. The training data is randomly split into 70% development data and 30% in-time validation data. The training data has 1.4% bad rate and I also need to assign weight for each observation (data has a weight column). Observation is: the model built with weight has much higher performance on development data (DEV) compared to the model built without weight (VAL). Model built with weight has big performance difference between development and in-time validation data. For instance, model build with weight shows below top 10% capture rate

DEV: 56%
Validation: 25%

While model build without weight shows below top 10% capture rate:

DEV: 35%
Validation: 23%

Seems use weight in this case helped on model performance on both development and in-time validation data. Wondering how exactly is weight used in the h2o? With weight used in the model building, does the bigger performance difference of the model on DEV and VAL illustrate higher instability of the gbm model building in h2o?

logloss with and without weight

Blue curve is the DEV, orange curve is for VAL>. For no weight case, log loss for DEV and VAL started from the same point. While for weighted case, log loss for DEV and VAL started from two different points. How to interpret this log loss chart, why weight in h2o gbm created such different in log loss function output?

1

1 Answers

0
votes

Without more information (such as the actual data), my guess would be the starting point gap you see is random noise.

To investigate this further I'd suggest first trying different random seeds. I like to do this using h2o.grid, making seed a hyper-parameter. Just 3 or 4 different values will give you a good feel for how much randomness affects the model.

The second thing I'd try is different train/valid splits. Again, explicitly give a seed to the split function, so that you can get repeatable results. If your data set is fairly small I'd expect this to be the bigger factor.

Putting these two ideas together: (rough code)

for split_seed in [1,1103,4387]:
  split data using seed
  h2o.grid(
    algorithm = "gbm",
    grid_id = "ww" + split_seed,
    hyper_params = list(
      seed = [77,800,2099]
      ),
    ...(with weights)...
    )
  h2o.grid(
    algorithm = "gbm",
    grid_id = "wo" + split_seed,
    hyper_params = list(
      seed = [77,800,2099]
      ),
    ...(without weights)...
    )

My guess is that if you overlay the 9 score history charts you get for with weights, and compare them to the 9 score history charts you get for without weights, you'll see a similar amount of blurriness on each.

If you always/never get that starting gap on all 9, depending on with/without weights, then something more interesting is going on, and I hope you can make available enough data and code so others can reproduce it.