4
votes

Using XGBoost xgb.importance an importance matrix can be printed showing variable importance values to classification as measured by Gain, Cover, and Frequency. Gain is the recommended indicator of variable importance. Using caret resampling (repeatedcv, number=10, repeats =5), a particular tuning grid, and train method = "xgbTree", the caret varImp() function shows the k-fold feature importance estimation scaled from 0-100%.

My question is does the caret varImp(xgbMod) wrapper function use Gain or all some combination of Gain, Cover, and Frequency?

1

1 Answers

2
votes

One small clarification:

the caret varImp() function shows the k-fold feature importance estimation scaled from 0-100%.

caret estimates feature importance from the final model fitted, and not from the cross validations. The cross validations tell you the best hyper parameters (e.g gamma etc) to fit the model with.

It is Gain, not much documentation, I checked using an example:

library(caret)
data = MASS::Pima.tr
set.seed(111)
mdl = train(type ~ .,data=data,method="xgbTree",tuneLength=3,
trControl=trainControl(method="cv"))

You set scale=FALSE to set the raw values:

varImp(mdl,scale=FALSE)
xgbTree variable importance

      Overall
glu   0.37953
age   0.19184
ped   0.16418
bmi   0.13755
npreg 0.06450
skin  0.04526
bp    0.01713

Compare with xgb.importance:

xgboost::xgb.importance(mdl$finalModel$feature_names,model=mdl$finalModel)
   Feature       Gain      Cover Frequency
1:     glu 0.37953480 0.17966683      0.16
2:     age 0.19183994 0.17190387      0.17
3:     ped 0.16417775 0.26768973      0.28
4:     bmi 0.13755463 0.09755036      0.09
5:   npreg 0.06450183 0.10811269      0.11
6:    skin 0.04526090 0.11229235      0.12
7:      bp 0.01713014 0.06278416      0.07