I found caret to be slow and it is not able to tune all the parameters of xgboost models without building a custom model which is quite more complicated than using ones own custom function for evaluation.
However if you are doing some up/down sampling or smote/rose caret is the way to go since it incorporates them correctly in the model evaluating phase (during re-sampling). See: https://topepo.github.io/caret/subsampling-for-class-imbalances.html
However I found these techniques have a very small impact on the results and usually for the worse, at least in the models I trained.
scale_pos_weight
gives a higher weight to a certain class, if the minority class is at 10% abundance then playing with scale_pos_weight
around 5 - 10
should be beneficial.
Tuning regularization parameters can be quite beneficial for xgboost: here one has several parameters: alpha
, beta
and gamma
- I found valid values to be 0 - 3. Other useful parameters that add direct regularization (by adding uncertainty) are subsample
, colsample_bytree
and colsample_bylevel
. I found that playing with colsample_bylevel
can also have a positive outcome on the model. subsample
and colsample_bytree
you are already utilizing.
I would test a much smaller eta and more trees to see if the model benefits. early_stopping_rounds
rounds can speed up the process in that case.
Other eval_metric
are probably going to be more beneficial than accuracy. Try logloss
or auc
and even map
and ndcg
Here is a function for grid search of hyper-parameters. It uses auc
as evaluation metric but one can change that easily
xgb.par.opt=function(train, seed){
require(xgboost)
ntrees=2000
searchGridSubCol <- expand.grid(subsample = c(0.5, 0.75, 1),
colsample_bytree = c(0.6, 0.8, 1),
gamma = c(0, 1, 2),
eta = c(0.01, 0.03),
max_depth = c(4,6,8,10))
aucErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
#Extract Parameters to test
currentSubsampleRate <- parameterList[["subsample"]]
currentColsampleRate <- parameterList[["colsample_bytree"]]
currentGamma <- parameterList[["gamma"]]
currentEta =parameterList[["eta"]]
currentMaxDepth =parameterList[["max_depth"]]
set.seed(seed)
xgboostModelCV <- xgb.cv(data = train,
nrounds = ntrees,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
early_stopping_rounds = 200,
stratified = T,
scale_pos_weight=sum(all_data[train,1]==0)/sum(all_data[train,1]==1),
max_depth = currentMaxDepth,
eta = currentEta,
gamma = currentGamma,
colsample_bytree = currentColsampleRate,
min_child_weight = 1,
subsample = currentSubsampleRate
seed = seed)
xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
auc = xvalidationScores[xvalidationScores$iter==xgboostModelCV$best_iteration,c(1,4,5)]
auc = cbind(auc, currentSubsampleRate, currentColsampleRate, currentGamma, currentEta, currentMaxDepth)
names(auc) = c("iter", "test.auc.mean", "test.auc.std", "subsample", "colsample", "gamma", "eta", "max.depth")
print(auc)
return(auc)
})
return(aucErrorsHyperparameters)
}
One can add other parameters to the expand.grid
call.
I usually train hyper-parameters on one CV repetition and evaluate them on additional repetitions with other seeds or on the validation set (but doing it on validation set should be used with caution to avoid over-fitting)
pred.cv = ifelse(bst$pred < 0.5, 0, 1)
andtable(pred.cv, y)
. – missuse