In my problem dataset response variable is extremely skewed to the left. I have tried to fit the model with h2o.randomForest()
and h2o.gbm()
as below. I can give tune min_split_improvement
and min_rows
to avoid overfitting in these two cases. But with these models, I see very high errors on the tail observations. I have tried using weights_column
to oversample the tail observations and undersample other observations, but it does not help.
h2o.model <- h2o.gbm(x = predictors, y = response, training_frame = train,valid = valid, seed = 1,
ntrees =150, max_depth = 10, min_rows = 2, model_id = "GBM_DD", balance_classes = T, nbins = 20, stopping_metric = "MSE",
stopping_rounds = 10, min_split_improvement = 0.0005)
h2o.model <- h2o.randomForest(x = predictors, y = response, training_frame = train,valid = valid, seed = 1,ntrees =150, max_depth = 10, min_rows = 2, model_id = "DRF_DD", balance_classes = T, nbins = 20, stopping_metric = "MSE",
stopping_rounds = 10, min_split_improvement = 0.0005)
I have tried the h2o.automl()
function of h2o package for the problem for better performance. However, I see significant overfitting. I don't know of any parameters in h2o.automl()
to control overfitting.
Does anyone know of a way to avoid overfitting with h2o.automl()
?
EDIT
The distribution of the log
transformed response is given below. After the suggestion from Erin
EDIT2: Distribution of original response.
BoxCoxTrans
that can help with skewness. – missuse