0
votes

I have a classification problem and one of the predictors is a categorical variable X with four levels A,B,C,D that was transformed to three dummy variables A,B,C. I was trying to use the Recursive Feature Selection (RFE) in the caret package to conduct feature selection. How do I tell the RFE function to consider A,B,C,D together? so if say A is excluded, B&C are excluded too.

After fighting with this all day, I'm still going nowhere...Feeding RFE using the formula interface also doesn't work. I think RFE automatically converts any factors to dummy variables.

Below is my example code:

#rfe settings
lrFuncs$summary<- twoClassSummary
trainctrl <- trainControl(classProbs= TRUE,
                      summaryFunction = twoClassSummary)

ctrl<-rfeControl(functions=lrFuncs,method = "cv", number=3)

#Data pre-process to exclude nzv and highly correlated variables
x<-training[,c(1, 4:25, 27:39)]
x2<-model.matrix(~., data = x)[,-1]
nzv <- nearZeroVar(x2,freqCut = 300/1)
x3 <- x2[, -nzv]
corr_mat <- cor(x3)
too_high <- findCorrelation(corr_mat, cutoff = .9)
x4 <- x3[, -too_high]

excludes<-c(names(data.frame(x3[, nzv])),names(data.frame(x3[, too_high])))

#Exclude the variables identified
x_frame<-x[ , -which(names(x) %in% c(excludes))]

#Run rfe
set.seed((408))
#This does not work with the error below
glmProfile<-rfe(x_frame,y,sizes =subsets, rfeControl = ctrl,trControl =trainctrl,metric = "ROC")
Error in { : task 1 failed - "undefined columns selected"
In addition: Warning messages:
1: glm.fit: fitted probabilities numerically 0 or 1 occurred 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred 
3: glm.fit: fitted probabilities numerically 0 or 1 occurred 

#it works if convert x_frame to matrix and then back to data frame, but this way rfe may remove some dummy variables (i.e.remove A but leave B&C)
glmProfile<-rfe(data.frame(model.matrix(~., data = x_frame)[,-1]),y,sizes =subsets, rfeControl = ctrl,trControl =trainctrl,metric = "ROC")

x_frame here, contains categorical variables that have multiple levels.

Any help is highly appreciated!

1
Thanks @grubjesic for the edit.ybeybe
Since there's no answer so far, I'll mention how I approached this for now. If the rfe function suggest excluding some of the levels of a categorical variable, I'd review the importance of the rest of the levels and decide to either exclude all levels all together or leave all of them in the model - basically run a few experiments. This approach is a bit manual but I think it's a viable approach.ybeybe
a) Did you mean to state you have 4 levels converted to 3 dummy variables? Obtaining 4 dummy variables would be the common way. b) Does your classifier actually need dummy variables? Otherwise you could consider converting them back to one variable with multiple levels. c) Is there a reason you need to exclude either all or non of the dummy variables? I can't imagine why one would want to do this when using RFE.geekoverdose
Thank you @geekoverdose for your comments. I was trying to use RFE to conduct variable/feature selection for logistic regression. I wanted to find the 'optimal' set of variables that performs best in cross-validation. from what I understood, RFE tries subsets of variables and measure their performances so it serves my purpose. I tried feeding RFE one variable with multiple levels, it resulted error (see error message in OP); after converting variable to dummies using model.matrix(), it worked (see last line of OP).ybeybe

1 Answers

0
votes

First: yes, you are right that you cannot use categorial features with RFE (there's a reasonable explanation of this by Max here on CV). And interestingly, encoding all levels into dummy variables really causes an error, which can be resolved by removing one dummy variable. Consequently, I too would preprocess your data by creating dummy variables from the categorial variable with leaving out one level.

But I would not try to keep either all or none of the dummy variables in the end. If RFE throws some of them out (but not all), then some levels just seem to hold more valuable information than others. This should be reasonable. Imagine level A of A,B,C holds valuable information for your target variable. In case A was kept during dummy variable creation, B and C would likely be discarded by RFE. In case A was discarded during dummy variable creation, B and C would likely both be kept by RFE.

PS: when mixing continuous and categorial information, consider scaling your data accordingly before handing it to RFE to ensure that the impact of continuous and categorial information on RFE is somewhat similar.