updated
for older versions of H2O (3.10.2 or less) you have to use a value less than 1 for a Bernoulli distribution with H2O gbm's offset_column
. However, for newer versions you will be able to pass in any value. In your case, using a Bernoulli distribution, one way to create the offset column is to use the predicted logit values from a previous model (just as you said you wanted to do in the comments).
This is how the gbm offset column works:
An offset is a per-row “bias value” that is used during model training. For Gaussian distributions, offsets can be seen as simple corrections to the response (y) column. Instead of learning to predict the response (y-row), the model learns to predict the (row) offset of the response column. For other distributions, the offset corrections are applied in the linearized space before applying the inverse link function to get the actual response values. This option is not applicable for multinomial distributions.
And here is a example of how to use this parameter on a toy dataset
(example with Bernoulli distribution)
library(h2o)
h2o.init()
# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
# convert response column to a factor
cars["economy_20mpg"] <- as.factor(cars["economy_20mpg"])
# create a new offset column by taking the log of the response column
cars["offset"] <- as.h2o(rep(.5, dim(cars)[1]))
# set the predictor names and the response column name
predictors <- c("displacement","power","weight","acceleration","year")
response <- "economy_20mpg"
# split into train and validation sets
cars.split <- h2o.splitFrame(data = cars,ratios = 0.8, seed = 1234)
train <- cars.split[[1]]
valid <- cars.split[[2]]
# try using the `off_set` parameter:
# training_frame and validation_frame
cars_gbm <- h2o.gbm(x = predictors, y = response, training_frame = train, offset_column = "offset",
validation_frame = valid, seed = 1234)
# print the auc for your model
print(h2o.auc(cars_gbm, valid = TRUE))
Gaussian example (where it makes more sense to use this option)
library(h2o)
h2o.init()
# import the boston dataset:
# this dataset looks at features of the boston suburbs and predicts median housing prices
# the original dataset can be found at https://archive.ics.uci.edu/ml/datasets/Housing
boston <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
# set the predictor names and the response column name
predictors <- colnames(boston)[1:13]
# set the response column to "medv", the median value of owner-occupied homes in $1000's
response <- "medv"
# convert the chas column to a factor (chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise))
boston["chas"] <- as.factor(boston["chas"])
# create a new offset column by taking the log of the response column
boston["offset"] <- log(boston["medv"])
# split into train and validation sets
boston.splits <- h2o.splitFrame(data = boston, ratios = .8, seed = 1234)
train <- boston.splits[[1]]
valid <- boston.splits[[2]]
# try using the `offset_column` parameter:
# train your model, where you specify the offset_column
boston_gbm <- h2o.gbm(x = predictors, y = response, training_frame = train,
validation_frame = valid,
offset_column = "offset",
seed = 1234)
# print the mse for validation set
print(h2o.mse(boston_gbm, valid = TRUE))