2
votes

I am using GBM model, and I wanna compare to other machine learning methods. I run with 5 folds. As I knew, they will separate the data into 5 folds, and chose one of them for the testing and the others for training. How to get 5 folds data from gbm of H2o lib?

I run it with Python language.

folds = 5
cars_gbm = H2OGradientBoostingEstimator(nfolds = folds, seed = 1234)
1

1 Answers

2
votes

There's two ways:

  1. You can create and specify the folds manually.
  2. You can ask H2O to save the fold indexes (for each row, which fold ID does it belong to?) and return them as a single-column of data, by setting keep_cross_validation_fold_assignment=True.

Here are some code examples:

import h2o
from h2o.estimators import *

h2o.init()

# Import cars dataset
cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
x = ["displacement","power","weight","acceleration","year"]
y = "economy_20mpg"
nfolds = 5

First way:

# Create a k-fold column and append to the cars dataset
# Or you can use an existing fold id column
cars["fold_id"] = cars.kfold_column(n_folds=nfolds, seed=1)

# Train a GBM
cars_gbm = H2OGradientBoostingEstimator(seed=1, fold_column = "fold_id",
              keep_cross_validation_fold_assignment=True)
cars_gbm.train(x=x, y=y, training_frame=cars)

# View the fold ids (identical to cars["fold_id"])
print(cars_gbm.cross_validation_fold_assignment())

Second way:

# Train a GBM & save fold IDs
cars_gbm = H2OGradientBoostingEstimator(seed=1, nfolds=nfolds,
              keep_cross_validation_fold_assignment=True)
cars_gbm.train(x=x, y=y, training_frame=cars)

# View the fold ids
print(cars_gbm.cross_validation_fold_assignment())