R caret / How does cross-validation for train within rfe work

Question

I have a question regarding the rfe function from the caret library. On the caret-homepage link they give the following RFE algorithm: algorithm

For this example I am using the rfe function with 3-fold cross-validation and the train function with a linear-SVM and 5-fold cross-validation.

library(kernlab)
library(caret)
data(iris)

# parameters for the tune function, used for fitting the svm
trControl <- trainControl(method = "cv", number = 5)

# parameters for the RFE function
rfeControl <- rfeControl(functions = caretFuncs, method = "cv",
                     number= 4, verbose = FALSE )

rf1 <- rfe(as.matrix(iris[,1:4]), as.factor(iris[,5]) ,sizes = c( 2,3) ,  
           rfeControl = rfeControl, trControl = trControl, method = "svmLinear")

From the algorithm above I assumed that the algorithm would work with 2 nested cross-validations:
1. rfe would split the data (150 samples) into 3 folds
2. the train function would be run on the training-set (100 samples) with 5 fold cross validation to tune the model parameters - with subsequent RFE.

What confuses me is that when I take a look on the results of the rfe function:

> lapply(rf1$control$index, length)
$Fold1
[1] 100
$Fold2
[1] 101
$Fold3
[1] 99

> lapply(rf1$fit$control$index, length)
$Fold1
[1] 120
$Fold2
[1] 120
$Fold3
[1] 120
$Fold4
[1] 120
$Fold5
[1] 120

From that it appears that the size of the training sets from the 5-fold cv is 120 samples when I would expect a size of 80. ??

So it would be great if someone could clarify how rfe and train work together.

Cheers

> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pROC_1.5.4      e1071_1.6-1     class_7.3-5     caret_5.15-048 
 [5] foreach_1.4.0   cluster_1.14.3  plyr_1.7.1      reshape2_1.2.1 
 [9] lattice_0.20-10 kernlab_0.9-15 

loaded via a namespace (and not attached):
 [1] codetools_0.2-8 compiler_2.15.1 grid_2.15.1     iterators_1.0.6
 [5] stringr_0.6.1   tools_2.15.1

5-fold CV leaves out one fifth of the data set for each CV arm. Therefore, you train on 120 each time and the test set is the remaining 30 samples. 30 samples * 5 = 150 samples. — tcash21
Yes, but according to the description of the algorithm the 5-fold-CV should be applied on the training data resulting from a 3-fold-cv. So therefore 1st training set = 150/3 * 2, 2nd 100/5 * 4 = 80. — Fabian_G
@Fabian_G did you ever figure this out? I'm running into the same issue, and was considering contacting topepo or filing a bug report. — Reilstein

ben ben · Accepted Answer · 2017-09-20T13:10:20

The problem here is that lapply(rf1$fit$control$index, length) does not store what we think it does.

For me to understand that it was necessary to look into the code. What happens there is the following:

When you call rfe the whole data is passed to the nominalRfeWorkflow.

In nominalRfeWorkflow, the train and test data splitted according to rfeControl (in our example 3 times according to the 3-folded CV rule) is passed to rfeIter. These splits we can find in our result under rf1$control$index.

In rfeIter the ~100 training samples (our example) are used to find the final variables (which is the output of that function). As I understand it, the ~50 test samples (our example) are used to calculate the performance for the different variable sets but they are only stored as external performance but not used to select the final variables. For selecting these the performance estimates of the 5 fold cross validation are used. But we cannot find these indices in the final result returned by rfe. If we really need them, we need to fetch them from fitObject$control$index in rfeIter, return them to nominalRfeWorkflow, then to rfe and from there in the resulting rfe-Class object returned by rfe.

So what is stored in lapply(rf1$fit$control$index, length)? - When rfe found the best variables the final model fit is created with the best variables and the full reference data (150). rf1$fit is created in rfe as follows:

fit <- rfeControl$functions$fit(x[, bestVar, drop = FALSE], y, first = FALSE, last = TRUE, ...)

This function is again runs the train function and does a final cross validation with the full reference data, the final feature set and trControl given via the ellipses (...). Since our trControl is supposed to do 5 fold CV it is thus correct that lapply(rf1$fit$control$index, length) returns 120 since we have to calculate 150/5*4=120.

R caret / How does cross-validation for train within rfe work

1 Answers