The results of a benchmark comparison for learners depends on the instantiation of the resampling. How can I account for this?

Question

I run the code below. If I deactivate instantiation (as shown), the results of my benchmark comparison will be different for the three benchmark experiments and the conclusion which learner performs better may be different.

How can I adress this issue? One way may be to average over a large number of resamplings. I could write code for this but maybe this is an option already when calling "benchmark"?

resampling = rsmp("cv", folds=20) 
#resampling$instantiate(task)   # results below will (and shall) differ, if instantiation is not performed here

design = benchmark_grid(
  tasks = task,
  learners = list(glrn_knn_pca, glrn_knn_nopca),
  resamplings = resampling
)

design2 = benchmark_grid(
  tasks = task,
  learners = list(glrn_knn_pca, glrn_knn_nopca),
  resamplings = resampling
)


design3 = benchmark_grid(
  tasks = task,
  learners = list(glrn_knn_pca, glrn_knn_nopca),
  resamplings = resampling
)


bmr = benchmark(design)
bmr2 = benchmark(design2)
bmr3 = benchmark(design3)

bmr$aggregate(msr("classif.auc"))   
bmr2$aggregate(msr("classif.auc"))   
bmr3$aggregate(msr("classif.auc"))

How large is the difference? If it's very large, you might need a different way of evaluating. For example leave-one-out CV should always give you the same results, but will be very expensive. — Lars Kotthoff
@pat-s : Yes, I tried. My problem is that setting a see would cover the problem that the decision which algorithm performs better depend strongly on the seed. So I need to prevent this from happening. — ds_col

pat-s pat-s · Accepted Answer · 2021-04-20T18:11:17

It looks to me that you may want to use repeated CV to minimize variability introduced by partitioning.

Instead of resampling = rsmp("cv", folds = 20) you could use resampling = rsmp("repeated_cv", folds = 20, repeats = 100) and create 100 different resampling scenarios and benchmark all your learners across these.

This is a common approach in ML to reduce the impact of a single partitioning.

The results of a benchmark comparison for learners depends on the instantiation of the resampling. How can I account for this?

2 Answers