2
votes

Right now, I'm trying to use Caret rfe function to perform the feature selection, because I'm in a situation with p>>n and most regression techniques that don't involve some sort of regularisation can't be used well. I already used a few techniques with regularisation (Lasso), but what I want to try now is reduce my number of feature so that I'm able to run, at least decently, any kind of regression algorithm on it.

control <- rfeControl(functions=rfFuncs, method="cv", number=5)
model <- rfe(trainX, trainY, rfeControl=control)
predict(model, testX)

Right now, if I do it like this, a feature selection algorithm using random forest will be run, and then the model with the best set of features, according to the 5-fold cross-validation, will be used for the prediction, right?

I'm curious about two things here: 1) Is there an easy way to take the set of feature, and train another function on it that the one used for the feature selection? For example, reducing the number of features from 500 to 20 or so that seem more important and then applying k-nearest neighborhood.

I'm imagining an easy way to do it that would look like that:

control <- rfeControl(functions=rfFuncs, method="cv", number=5)
model <- rfe(trainX, trainY, method = "knn", rfeControl=control)
predict(model, testX)

2) Is there a way to tune the parameters of the feature selection algorithm? I would like to have some control on the values of mtry. The same way you can pass a grid of value when you are using the train function from Caret. Is there a way to do such a thing with rfe?

1

1 Answers

4
votes

Here is a short example on how to perform rfe with an inbuilt model:

library(caret)
library(mlbench) #for the data
data(Sonar)

rctrl1 <- rfeControl(method = "cv",
                     number = 3,
                     returnResamp = "all",
                     functions = caretFuncs,
                     saveDetails = TRUE)

model <- rfe(Class ~ ., data = Sonar,
             sizes = c(1, 5, 10, 15),
             method = "knn",
             trControl = trainControl(method = "cv",
                                      classProbs = TRUE),
             tuneGrid = data.frame(k = 1:10),
             rfeControl = rctrl1)

model
#output
Recursive feature selection

Outer resampling method: Cross-Validated (3 fold) 

Resampling performance over subset size:

 Variables Accuracy  Kappa AccuracySD KappaSD Selected
         1   0.6006 0.1984    0.06783 0.14047         
         5   0.7113 0.4160    0.04034 0.08261         
        10   0.7357 0.4638    0.01989 0.03967         
        15   0.7741 0.5417    0.05981 0.12001        *
        60   0.7696 0.5318    0.06405 0.13031         

The top 5 variables (out of 15):
   V11, V12, V10, V49, V9

model$fit$results
#output
    k  Accuracy     Kappa AccuracySD   KappaSD
1   1 0.8082684 0.6121666 0.07402575 0.1483508
2   2 0.8089610 0.6141450 0.10222599 0.2051025
3   3 0.8173377 0.6315411 0.07004865 0.1401424
4   4 0.7842208 0.5651094 0.08956707 0.1761045
5   5 0.7941775 0.5845479 0.07367886 0.1482536
6   6 0.7841775 0.5640338 0.06729946 0.1361090
7   7 0.7932468 0.5821317 0.07545889 0.1536220
8   8 0.7687229 0.5333385 0.05164023 0.1051902
9   9 0.7982468 0.5918922 0.07461116 0.1526814
10 10 0.8030087 0.6024680 0.06117471 0.1229467

for more customization see:

https://topepo.github.io/caret/recursive-feature-elimination.html