0
votes

In MLR there is a method to implement the nested cross validation. In nested cv, the inner loop is used to select the best tuning parameters and the outer loop is used to evaluate the model performance. When I combine nested cv with the feature selection process, I'm a bit confounded about what will MLR return about the inner bested tuned model. For example, I want to first apply a filter based on the correlation p value with outcome<0.05. In nested cv (I say it in training, validation and test set), it should be: In the inner loop, for each training set, apply the filter, then tune the parameter we're interested and test in the validation set. In the inner loop, we can get the best tuning parameter and the feature set associated with it.

What I'm wondering is what the inner best tuned parameter will return for outer loop training, I assume there are two possible models:

  1. The inner best tuned model just return the best tuned parameter, not the selected feature subset. So in the outer loop, we'll first apply the same filter, then train the training+validation set with the best tuned parameter.

  2. The inner best tuned model return the best tuned parameter and the selected feature subset. So in the outer loop, we'll just train the training+validation set with the best tuned parameter and selected feature subset (from the inner loop).

In my opinion, I think the first one is more logic. Part of my code is as below:

svm_learner<-makeLearner("classif.svm",predict.type="prob",fix.factors.prediction = TRUE)
svm_filter<-makeFilterWrapper(learner = svm_learner,
                          fw.method = "t.test.filter", fw.threshold = -0.05)
svm_filter_nested<-makeTuneWrapper(svm_filter,par.set=ps,
                        control=ctrl,resampling=inner)
r=resample(svm_filter_nested,task,resampling=outer,models=TRUE)
1

1 Answers

0
votes

Option 2) is correct.

Hyperpars are optimized for the chosen feature subset. I would not make sense to do so if you rerun the filter process again in the outer loop.

There is not more than train/predict happening in each outer fold with parameters coming from the inner optimization loop. No optimization is going on in the outer loop.

PS: You might want to ask such general questions on https://stats.stackexchange.com/ rather than on Stackoverflow since they are related to general (statistical) concepts rather than programming. People will vote to close such questions since they lack a relation to programming. (Note that no one of the mlr-team is watching stats.stackexchange questions though)