There's two ways:
- You can create and specify the folds manually.
- You can ask H2O to save the fold indexes (for each row, which fold ID does it belong to?) and return them as a single-column of data, by setting
keep_cross_validation_fold_assignment=True
.
Here are some code examples:
import h2o
from h2o.estimators import *
h2o.init()
# Import cars dataset
cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
x = ["displacement","power","weight","acceleration","year"]
y = "economy_20mpg"
nfolds = 5
First way:
# Create a k-fold column and append to the cars dataset
# Or you can use an existing fold id column
cars["fold_id"] = cars.kfold_column(n_folds=nfolds, seed=1)
# Train a GBM
cars_gbm = H2OGradientBoostingEstimator(seed=1, fold_column = "fold_id",
keep_cross_validation_fold_assignment=True)
cars_gbm.train(x=x, y=y, training_frame=cars)
# View the fold ids (identical to cars["fold_id"])
print(cars_gbm.cross_validation_fold_assignment())
Second way:
# Train a GBM & save fold IDs
cars_gbm = H2OGradientBoostingEstimator(seed=1, nfolds=nfolds,
keep_cross_validation_fold_assignment=True)
cars_gbm.train(x=x, y=y, training_frame=cars)
# View the fold ids
print(cars_gbm.cross_validation_fold_assignment())