0
votes

Good morning,

I have a question about calculating feature importance for bagged and boosted regression tree models with MLR package in R. I am using XGBOOST to make predictions and i'm using bagging to estimate prediction uncertainty. My data set is relatively large; approximately 10k features and observations. The predictions work perfectly (see code below), but I can't seem to calculate feature importance (the last line in the code below). The importance function crashes with no errors... and freezes the R session. I saw some related python code, where people seem to calculate the importance for each of the bagged models here and here. I haven't been able to get that to work properly in R either. Specifically, i'm not sure how to access individual models within the objected produced by MLR (mb object in the code below). In python, this seems to be trivial. In R, i can't seem to extract mb$learner.model, which seems logically closest to what i need. So i'm wondering if anyone had any experience with this issues?

Please see the code below

learn1 <- makeRegrTask(data = train.all , target= "resp", weights = weights1)
lrn.xgb <- makeLearner("regr.xgboost", predict.type = "response")
lrn.xgb$par.vals <- list( objective="reg:squarederror", eval_metric="error", nrounds=300, gamma=0, booster="gbtree", max.depth=6)

lrn.xgb.bag = makeBaggingWrapper(lrn.xgb, bw.iters = 50, bw.replace = TRUE,  bw.size = 0.85, bw.feats = 1)
lrn.xgb.bag <- setPredictType(lrn.xgb.bag, predict.type="se")
mb = mlr::train(lrn.xgb.bag, learn1)

fimp1 <- getFeatureImportance(mb)
1
getFeatureImportance() takes a wrapped model so mb should be fine here. Also see ?mlr::getLearnerModel(). Have a look at the vignettes also.pat-s
@pat-s Thank you. I'm getting some interesting errors with these option. mlr::getFeatureImportance(mb) gives me Error in xgboost::xgb.importance(feature_names = .model$features, model = mod : model: must be an object of class xgb.Booster. However, i can extract individual model mb1 <- getLearnerModel(mb, more.unwrap = T), try to get importance for a single model mlr::getFeatureImportance(mb1[[1]]) and get Error: Assertion on 'object' failed: Must inherit from class 'WrappedModel', but has class 'xgb.Booster'. This looks to me like a class issue?Yodi
@pat-s was partly wrong. getFeatureImportance() takes a WrappedModel but the model created by a BaggingWrapper is a HomogeneousEnsembleModel, for which mlr does not offer an own method. So you have to aggregate the feature importance values manually. However, it won't work if each model is just trained on one feature (bw.feats = 1) like in your example.jakob-r
@jakob-r Thanks. The bw.feats=1 refers to "Percentage size of randomly selected features in bags", so each model has many features. But thanks for the clarification. However, if i understand you correctly, i can manually create a ensemble with 50 different models (same learner but different names) and then the getFeatureImportance() should work on the ensemble?Yodi
Sorry, you are right. bw.feats = 1 equals 100% of the features which is a sensible decision. I posted an answer that should work.jakob-r

1 Answers

0
votes

If you set bw.feats = 1 it might be feasible to average the feature importance values. Basically you just have to apply over all single models that are stored in the HomogeneousEnsembleModel. Some extra care is necessary because the order of the features gets mixed up because of the sampling - although we set it to 100%.

library(mlr)
data = data.frame(x1 = runif(100), x2 = runif(100), x3 = runif(100))
data$y = with(data, x1 + 2 * x2 + 0.1 * x3 + rnorm(100))
task = makeRegrTask(data = data, target = "y")
lrn.xgb = makeLearner("regr.xgboost", predict.type = "response")
lrn.xgb$par.vals = list( objective="reg:squarederror", eval_metric="error", nrounds=50, gamma=0, booster="gbtree", max.depth=6)

lrn.xgb.bag = makeBaggingWrapper(lrn.xgb, bw.iters = 10, bw.replace = TRUE,  bw.size = 0.85, bw.feats = 1)
lrn.xgb.bag = setPredictType(lrn.xgb.bag, predict.type="se")
mb = mlr::train(lrn.xgb.bag, task)
fimps = lapply(mb$learner.model$next.model, function(x) getFeatureImportance(x)$res)
fimp = fimps[[1]]
# we have to take extra care because the results are not ordered
for (i in 2:length(fimps)) {
  fimp = merge(fimp, fimps[[i]], by = "variable")
}
rowMeans(fimp[,-1]) # only makes sense with bw.feats = 1
# [1] 0.2787052 0.4853880 0.2359068