2
votes

I am working on a regression model using XGBoost trying to predict dollars spent by customers in a year. I have ~6,000 samples (customers), ~200 features related to those customers, and the amount they spent in a year (my outcome variable). I have split my data into a 75% / 25% train / test split and have run a few XGBoost models with varying degrees of success…

There appears to be some overfit in my initial model with no tuning (default parameters), which had the following R2 values:
• Training R2 – 0.593
• Test R2 – 0.098

I then ran a grid search of the following hyperparameters, which did not improve the model significantly.

param_grid = {'learning_rate' : [0.05, 0.10, 0.20],  
          'min_child_weight': [1, 5, 10],  
          'gamma': [0.5, 1, 5],  
          'subsample': [0.6, 0.8, 1.0],  
          'colsample_bytree': [0.6, 0.8, 1.0],  
           'max_depth': [3, 4, 5]  
         }  
grid = GridSearchCV(xgb.XGBRegressor(silent=True)
                   ,param_grid
                   ,n_jobs=1
                   ,cv=3
                   ,scoring='r2'
                   ,verbose=1
                   ,refit=True)

• Training R2 – 0.418
• Test R2 – 0.093

I also manually tuned the hyperparameters and was able to get the following results, but that's about it.
• Training R2 – 0.573
• Test R2 – 0.148

These 6,000 customers represent all of the customers for the year, so I can't bring in additional samples to improve sample size.

My Question: Are there suggestions for other hyperparameters to tune or strategies I should try to make the model more consistent across train / test splits and reduce overfit? It's possible that there is too much variance in my outcome variable (dollars spent) to create a consistent model, but I want to try to exhaust all options.

2

2 Answers

1
votes

There is a simple rule for machine learning. You can make your model do wonders if your data has some signal and if it doesn't have a signal, it simply doesn't.

But, I am still willing to answer your question and if there's some signal you can definitely improve on your R squared value.

Firstly, try to reduce your features. 200 is a lot of features for 4500 rows of data. Try using different numbers of features like 20, 50, 80, 100, etc up to 100. Or you can use things like SelectKBest of sklearn or calculate the effect-size of the features to select the best K features.

Secondly, the problem might be in your test data. The test data might represent a completely different subset of data as compared to your train data. You should try doing cross-validation so that the R squared value you are reporting is confident enough as it has seen various subsets of data.

Thirdly, instead of using XGBoost regression, try using simpler regression methods like Linear, Lasso, Ridge, Elastic Net, etc and see if you can get something better.

0
votes

The results are quite low, but that is not the question of hyperparameters tuning. My recommendations are next:

  1. Analyze the correlation between features and money spent. To make a decision you can calculate feature importance, build a correlation matrix, etc. Sometimes I change the feature list manually based on my assumption and review how it affects the score. Make sure you understand the influence of each feature. Useless features should be removed.
  2. Low R^2 often is the outcome of uncleared data. Check for outlines. Try not to replace NANs with 0 all the time, sometimes it's better to remove the raw. If you grab the data from a third party, there can be some errors as well.
  3. Review the predictions of the test data set in detail. Go case by case to understand why the model fails... Usually, that helps to find the source of the issue.

PS In my experience, hyperparameters tuning can gain up to 3% accuracy. But that won't help your model.