How can I improve my XGBoost model if hyperparameter tuning is having minimal impact?

Question

I am working on a regression model using XGBoost trying to predict dollars spent by customers in a year. I have ~6,000 samples (customers), ~200 features related to those customers, and the amount they spent in a year (my outcome variable). I have split my data into a 75% / 25% train / test split and have run a few XGBoost models with varying degrees of success…

There appears to be some overfit in my initial model with no tuning (default parameters), which had the following R² values:
• Training R² – 0.593
• Test R² – 0.098

I then ran a grid search of the following hyperparameters, which did not improve the model significantly.

param_grid = {'learning_rate' : [0.05, 0.10, 0.20],  
          'min_child_weight': [1, 5, 10],  
          'gamma': [0.5, 1, 5],  
          'subsample': [0.6, 0.8, 1.0],  
          'colsample_bytree': [0.6, 0.8, 1.0],  
           'max_depth': [3, 4, 5]  
         }  
grid = GridSearchCV(xgb.XGBRegressor(silent=True)
                   ,param_grid
                   ,n_jobs=1
                   ,cv=3
                   ,scoring='r2'
                   ,verbose=1
                   ,refit=True)

• Training R² – 0.418
• Test R² – 0.093

I also manually tuned the hyperparameters and was able to get the following results, but that's about it.
• Training R² – 0.573
• Test R² – 0.148

These 6,000 customers represent all of the customers for the year, so I can't bring in additional samples to improve sample size.

My Question: Are there suggestions for other hyperparameters to tune or strategies I should try to make the model more consistent across train / test splits and reduce overfit? It's possible that there is too much variance in my outcome variable (dollars spent) to create a consistent model, but I want to try to exhaust all options.

Vatsal Gupta Vatsal Gupta · Accepted Answer · 2020-03-18T12:17:49

There is a simple rule for machine learning. You can make your model do wonders if your data has some signal and if it doesn't have a signal, it simply doesn't.

But, I am still willing to answer your question and if there's some signal you can definitely improve on your R squared value.

Firstly, try to reduce your features. 200 is a lot of features for 4500 rows of data. Try using different numbers of features like 20, 50, 80, 100, etc up to 100. Or you can use things like SelectKBest of sklearn or calculate the effect-size of the features to select the best K features.

Secondly, the problem might be in your test data. The test data might represent a completely different subset of data as compared to your train data. You should try doing cross-validation so that the R squared value you are reporting is confident enough as it has seen various subsets of data.

Thirdly, instead of using XGBoost regression, try using simpler regression methods like Linear, Lasso, Ridge, Elastic Net, etc and see if you can get something better.

How can I improve my XGBoost model if hyperparameter tuning is having minimal impact?

2 Answers