offset_column in h2o.gbm

Question

I'm using H2O 3.10.4.1

I'm trying to fit a Bernoulli model with GBM using some initial predictions from some other model and I'm getting worse Likelihoods than starting predictions. I was able to reproduce it using titanic data.

I was able to use R's gbm to do what I want. R's gbm.fit asks for offset on the link scale, which is not restricted, it could be very high or very low negative values.

However, when I try to do the same in H2O GBM, it throws an error:

water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_model_R_1489164084643_3568. Details: ERRR on field: _offset_column: Offset cannot be larger than 1 for Bernoulli distribution.

My Jupyter notebook is here: Github

UPDATE I was able to use offset, but only for a dataframe where ProbabilityLink is less than 1. Since H2O complains about it. See cells 65-68 in the Notebook.

I believe that this is a bug in H2O. They should just remove the requirement that offset must me less than 1 for Bernoulli. It can be anything. And then it should work fine.

Lauren Lauren · Accepted Answer · 2017-03-17T16:17:08

updated

for older versions of H2O (3.10.2 or less) you have to use a value less than 1 for a Bernoulli distribution with H2O gbm's offset_column. However, for newer versions you will be able to pass in any value. In your case, using a Bernoulli distribution, one way to create the offset column is to use the predicted logit values from a previous model (just as you said you wanted to do in the comments).

This is how the gbm offset column works: An offset is a per-row “bias value” that is used during model training. For Gaussian distributions, offsets can be seen as simple corrections to the response (y) column. Instead of learning to predict the response (y-row), the model learns to predict the (row) offset of the response column. For other distributions, the offset corrections are applied in the linearized space before applying the inverse link function to get the actual response values. This option is not applicable for multinomial distributions.

And here is a example of how to use this parameter on a toy dataset

(example with Bernoulli distribution)

library(h2o)
h2o.init()

# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")

# convert response column to a factor
cars["economy_20mpg"] <- as.factor(cars["economy_20mpg"])

# create a new offset column by taking the log of the response column
cars["offset"] <- as.h2o(rep(.5, dim(cars)[1]))

# set the predictor names and the response column name
predictors <- c("displacement","power","weight","acceleration","year")
response <- "economy_20mpg"

# split into train and validation sets
cars.split <- h2o.splitFrame(data = cars,ratios = 0.8, seed = 1234)
train <- cars.split[[1]]
valid <- cars.split[[2]]

# try using the `off_set` parameter:
# training_frame and validation_frame
cars_gbm <- h2o.gbm(x = predictors, y = response, training_frame = train, offset_column = "offset",
                  validation_frame = valid, seed = 1234)

# print the auc for your model
print(h2o.auc(cars_gbm, valid = TRUE))

Gaussian example (where it makes more sense to use this option)

library(h2o)
h2o.init()

# import the boston dataset:
# this dataset looks at features of the boston suburbs and predicts         median housing prices
# the original dataset can be found at     https://archive.ics.uci.edu/ml/datasets/Housing
boston <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")

# set the predictor names and the response column name
predictors <- colnames(boston)[1:13]
# set the response column to "medv", the median value of owner-occupied     homes in $1000's
response <- "medv"

# convert the chas column to a factor (chas = Charles River dummy     variable (= 1 if tract bounds river; 0 otherwise))
boston["chas"] <- as.factor(boston["chas"])

# create a new offset column by taking the log of the response column
boston["offset"] <- log(boston["medv"])

# split into train and validation sets
boston.splits <- h2o.splitFrame(data =  boston, ratios = .8, seed = 1234)
train <- boston.splits[[1]]
valid <- boston.splits[[2]]

# try using the `offset_column` parameter:
# train your model, where you specify the offset_column
boston_gbm <- h2o.gbm(x = predictors, y = response, training_frame = train,
               validation_frame = valid,
               offset_column = "offset",
               seed = 1234)

# print the mse for validation set
print(h2o.mse(boston_gbm, valid = TRUE))

offset_column in h2o.gbm

1 Answers