5
votes

In my problem dataset response variable is extremely skewed to the left. I have tried to fit the model with h2o.randomForest() and h2o.gbm() as below. I can give tune min_split_improvement and min_rows to avoid overfitting in these two cases. But with these models, I see very high errors on the tail observations. I have tried using weights_column to oversample the tail observations and undersample other observations, but it does not help.

h2o.model <- h2o.gbm(x = predictors, y = response, training_frame = train,valid = valid, seed = 1,
                              ntrees =150, max_depth = 10, min_rows = 2, model_id = "GBM_DD", balance_classes = T, nbins = 20, stopping_metric = "MSE", 
                     stopping_rounds = 10, min_split_improvement = 0.0005)


h2o.model <- h2o.randomForest(x = predictors, y = response, training_frame = train,valid = valid, seed = 1,ntrees =150, max_depth = 10, min_rows = 2, model_id = "DRF_DD", balance_classes = T, nbins = 20, stopping_metric = "MSE", 
                     stopping_rounds = 10, min_split_improvement = 0.0005)

I have tried the h2o.automl() function of h2o package for the problem for better performance. However, I see significant overfitting. I don't know of any parameters in h2o.automl() to control overfitting.

Does anyone know of a way to avoid overfitting with h2o.automl()?

EDIT

The distribution of the log transformed response is given below. After the suggestion from Erin enter image description here

EDIT2: Distribution of original response.

enter image description here

2
Perhaps try to transform the problematic features, caret has a useful function BoxCoxTrans that can help with skewness.missuse
It looks like a "Possion distribution", so i would either use a linear model where I specify the distribution, or I would try to use boosting which will handle this. Here is what boosting does: i2.wp.com/freakonometrics.hypotheses.org/files/2015/07/…Esben Eickhardt

2 Answers

13
votes

H2O AutoML uses H2O algos (e.g. RF, GBM) underneath, so if you're not able to get good models there, you will suffer from the same issues using AutoML. I am not sure that I would call this overfitting -- it's more that your models are not doing well at predicting outliers.

My recommendation is to log your response variable -- that's a useful thing to do when you have a skewed response. In the future, H2O AutoML will try to detect a skewed response automatically and take the log, but that's not a feature of the the current version (H2O 3.16.*).

Here's a bit more detail if you are not familiar with this process. First, create a new column, e.g. log_response, as follows and use that as the response when training (in RF, GBM or AutoML):

train[,"log_response"] <- h2o.log(train[,response])

Caveats: If you have zeros in your response, you should use h2o.log1p() instead. Make sure not to include the original response in your predictors. In your case, you don't need to change anything because you are already explicitly specifying the predictors using a predictors vector.

Keep in mind that when you log the response that your predictions and model metrics will be on the log scale. So if you need to convert your predictions back to the normal scale, like this:

model <- h2o.randomForest(x = predictors, y = "log_response", 
                          training_frame = train, valid = valid)
log_pred <- h2o.predict(model, test)
pred <- h2o.exp(log_pred)

This gives you the predictions, but if you also want to see the metrics, you will have to compute those using the h2o.make_metrics() function using the new preds rather than extracting the metrics from the model.

perf <- h2o.make_metrics(predicted = pred, actual = test[,response])
h2o.mse(perf)

You can try this using RF like I showed above, or a GBM, or with AutoML (which should give better performance than a single RF or GBM).

Hopefully that helps improve the performance of your models!

0
votes

When your target variable is skewed, mse is not a good metric to use. I would try changing the loss function because gbm tries to fit the model to the gradient of the loss function and you want to make sure that you are using the correct distribution. if you have a spike on zero and right skewed positive target, probably Tweedie would be a better option.