I am working on a regression model using XGBoost trying to predict dollars spent by customers in a year. I have ~6,000 samples (customers), ~200 features related to those customers, and the amount they spent in a year (my outcome variable). I have split my data into a 75% / 25% train / test split and have run a few XGBoost models with varying degrees of success…
There appears to be some overfit in my initial model with no tuning (default parameters), which had the following R2 values:
• Training R2 – 0.593
• Test R2 – 0.098
I then ran a grid search of the following hyperparameters, which did not improve the model significantly.
param_grid = {'learning_rate' : [0.05, 0.10, 0.20],
'min_child_weight': [1, 5, 10],
'gamma': [0.5, 1, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'max_depth': [3, 4, 5]
}
grid = GridSearchCV(xgb.XGBRegressor(silent=True)
,param_grid
,n_jobs=1
,cv=3
,scoring='r2'
,verbose=1
,refit=True)
• Training R2 – 0.418
• Test R2 – 0.093
I also manually tuned the hyperparameters and was able to get the following results, but that's about it.
• Training R2 – 0.573
• Test R2 – 0.148
These 6,000 customers represent all of the customers for the year, so I can't bring in additional samples to improve sample size.
My Question: Are there suggestions for other hyperparameters to tune or strategies I should try to make the model more consistent across train / test splits and reduce overfit? It's possible that there is too much variance in my outcome variable (dollars spent) to create a consistent model, but I want to try to exhaust all options.