I''m trying to use XGBoost for a particular dataset that contains around 500,000 observations and 10 features, I'm trying to do some hyperparameter tuning with RandomizedSeachCV, and the best parameters are worse than the model with the default parameters
model = XGBRegressor()
model.fit(X_train,y_train["speed"])
y_predict_speed = model.predict(X_test)
from sklearn.metrics import r2_score
print("R2 score:", r2_score(y_test["speed"],y_predict_speed, multioutput='variance_weighted'))
R2 score: 0.3540656307310167
Randomized
booster=['gbtree','gblinear']
base_score=[0.25,0.5,0.75,1]
## Hyper Parameter Optimization
n_estimators = [100, 500, 900, 1100, 1500]
max_depth = [2, 3, 5, 10, 15]
booster=['gbtree','gblinear']
learning_rate=[0.05,0.1,0.15,0.20]
min_child_weight=[1,2,3,4]
# Define the grid of hyperparameters to search
hyperparameter_grid = {
'n_estimators': n_estimators,
'max_depth':max_depth,
'learning_rate':learning_rate,
'min_child_weight':min_child_weight,
'booster':booster,
'base_score':base_score
}
# Set up the random search with 4-fold cross validation
random_cv = RandomizedSearchCV(estimator=regressor,
param_distributions=hyperparameter_grid,
cv=5, n_iter=50,
scoring = 'neg_mean_absolute_error',n_jobs = 4,
verbose = 5,
return_train_score = True,
random_state=42)
random_cv.fit(X_train,y_train["speed"])
random_cv.best_estimator_
XGBRegressor(base_score=0.5, booster='gblinear', colsample_bylevel=None,
colsample_bynode=None, colsample_bytree=None, gamma=None,
gpu_id=-1, importance_type='gain', interaction_constraints=None,
learning_rate=0.15, max_delta_step=None, max_depth=15,
min_child_weight=3, missing=nan, monotone_constraints=None,
n_estimators=500, n_jobs=16, num_parallel_tree=None,
random_state=0, reg_alpha=0, reg_lambda=0, scale_pos_weight=1,
subsample=None, tree_method=None, validate_parameters=1,
verbosity=None)
Using the model
regressor = XGBRegressor(base_score=0.5, booster='gblinear', colsample_bylevel=None,
colsample_bynode=None, colsample_bytree=None, gamma=None,
gpu_id=-1, importance_type='gain', interaction_constraints=None,
learning_rate=0.15, max_delta_step=None, max_depth=15,
min_child_weight=3, monotone_constraints=None,
n_estimators=500, n_jobs=16, num_parallel_tree=None,
random_state=0, reg_alpha=0, reg_lambda=0, scale_pos_weight=1,
subsample=None, tree_method=None, validate_parameters=1,
verbosity=None)
regressor.fit(X_train,y_train["speed"])
y_pred = regressor.predict(X_test)
from sklearn.metrics import r2_score
print("R2 score:", r2_score(y_test["speed"],y_pred, multioutput='variance_weighted'))
R2 score: 0.14258774171629718
As you can see after 3 hours of running the Randomized Search the accuracy actually drop, if I change linear to tree the value goes up to 0.65, so why is the Randomized not working - I'm getting a warning with the following:
This may not be accurate due to some parameters are only used in language bindings but passed down to XGBoost core. Or some parameters are not used but slip through this verification. Please open an issue if you find above cases.
Does anyone have a suggestion regarding this Hypertuning method?
Cheers