1
votes

So, im using the superconductivity dataset found here... It contains 82 variables and I am subsetting the data to 2000 rows. But when I use xgboost with mlr3 it does not calculate the importance for all the variables!?

Here's how I'm setting everything up:

# Read in data
mydata <- read.csv("/Users/.../train.csv", sep = ",")
data <- mydata[1:2000,]

# set up xgboost using mlr3
myTaskXG = TaskRegr$new(id = "data", backend = data, target = "critical_temp")
myLrnXG = lrn("regr.xgboost")
myModXG <- myLrnXG$train(myTaskXG)

# Take a look at the importance
myLrnXG$importance() 

this outputs something like this:

     wtd_mean_FusionHeat      std_ThermalConductivity              entropy_Density 
             0.685125173                  0.105919410                  0.078925149 
    wtd_gmean_FusionHeat      wtd_range_atomic_radius           entropy_FusionHeat 
             0.038797205                  0.038461823                  0.020889094 
        wtd_mean_Density           wtd_std_FusionHeat    gmean_ThermalConductivity 
             0.017211730                  0.006662321                  0.005598844 
    wtd_entropy_ElectronAffinity   wtd_entropy_Density 
             0.001292733                  0.001116518 

As you can see, there are only 11 variables there... when there should be 81.... if I do a similar process using ranger, everything works perfectly.

Any suggestions as to what is happening?

1

1 Answers

2
votes

Short answer: {xgboost} does not return all variables.

Longer answer:

This is not a mlr3 question but one about the xgboost package. The importance method from this learner simply calls xgboost::xgb.importance. If you look at the example on this page:

data(agaricus.train, package='xgboost')
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2, 
               eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
xgb.importance(model = bst)

This returns

> xgb.importance(model = bst)
                   Feature       Gain     Cover Frequency
1:               odor=none 0.67615471 0.4978746       0.4
2:         stalk-root=club 0.17135375 0.1920543       0.2
3:       stalk-root=rooted 0.12317236 0.1638750       0.2
4: spore-print-color=green 0.02931918 0.1461960       0.2

But there are 127 variables in the total dataset.

The maths behind this is just that ranger and xgboost use different importance methods, xgboost only includes the features actually used in the fitted model, whereas ranger uses impurity or permutation and considers all features at all splits.

By the way next time please provide a reprex (short reproducible example using easily accessible data and code).