How to interpret the aggregated performance result of nested resampling in mlr3?

Question

Recently I am learning about the nested resampling in mlr3 package. According to the mlr3 book, the target of nested resampling is getting an unbiased performance estimates for learners. I run a test as follow:

# loading packages
library(mlr3)
library(paradox)
library(mlr3tuning)

# setting tune_grid
tune_grid <- ParamSet$new(
  list(
  ParamInt$new("mtry", lower = 1, upper = 15),
  ParamInt$new("num.trees", lower = 50, upper = 200))
  )

# setting AutoTuner
at <- AutoTuner$new(
  learner = lrn("classif.ranger", predict_type = "prob"),
  resampling = rsmp("cv", folds = 5),
  measure = msr("classif.auc"),
  search_space = tune_grid,
  tuner = tnr("grid_search", resolution = 3),
  terminator = trm("none"),
  store_tuning_instance = TRUE)

# nested resampling
set.seed(100)
resampling_outer <- rsmp("cv", folds = 3)   # outer resampling
rr <- resample(task_train, at, resampling_outer, store_models = TRUE)

> lapply(rr$learners, function(x) x$tuning_result)
[[1]]
   mtry num.trees learner_param_vals  x_domain classif.auc
1:    1       200          <list[2]> <list[2]>   0.7584991

[[2]]
   mtry num.trees learner_param_vals  x_domain classif.auc
1:    1       200          <list[2]> <list[2]>   0.7637077

[[3]]
   mtry num.trees learner_param_vals  x_domain classif.auc
1:    1       125          <list[2]> <list[2]>   0.7645588

> rr$aggregate(msr("classif.auc"))
classif.auc 
  0.7624477

The result shows that the 3 hyperparameters chosen from 3 inner resampling are not garantee to be the same. It is similar to this post(it gets 3 different cp from the inner resampling): mlr3 resample autotuner - not showing tuned parameters?.

My question is:

I used to consider that the aggregate result rr$aggregate is the mean of the 3 models, but it is not, (0.7584991 + 0.7637077 + 0.7645588) / 3 = 0.7622552, not 0.7624477, do I misunderstand the aggregate result?
How to interpret this aggregated performance result from 3 models with different best hyperparameter from the inner resampling process? It is an unbias performance of what?

Thanks!

be-marc be-marc · Accepted Answer · 2021-02-26T12:50:20

The result shows that the 3 hyperparameters chosen from 3 inner resampling are not garantee to be the same.

It sounds like you want to fit a final model with the hyperparameters selected in the inner resamplings. Nested resampling is not used to select hyperparameter values for a final model. Only check the inner tuning results for stable hyperparameters. This means that the selected hyperparameters should not vary too much.

Yes, you are comparing the aggregated performance of all outer resampling test sets (rr$aggregate()) with the performances estimated on the inner resampling test sets (lapply(rr$learners, function(x) x$tuning_result)).
The aggregated performance of all outer resampling iterations is the unbiased performance of a ranger model with optimal hyperparameters found by grid search. You can run at$train(task) to get a final model and report the performance estimated with nested resampling as the unbiased performance of this model.

How to interpret the aggregated performance result of nested resampling in mlr3?

1 Answers