1
votes

For building a classification model, I am trying to select the most important features from the data set.
My data contains mixed attributes ( numerical and categorical). I am planning to apply (importance or varImp) functions in R after applying Random forest to select features from the data to improve the accuracy of my model.

My question is: Can I apply Random forest directly on the data without transformation step or I have to convert categorical attributes into binary (0,1)

I have applied Random forest with importance / varImp functions on a numeric dataset, the model works fine, but I am not sure about mixed data.

2
First, As the variable importance measures the decrease in accuracy if one column is removed, you should be fine. Have you tried is so far? Did you have any problems? Q could be flagged as off-toppic...loki
I applied Random forest on numeric data set only. I want to improve the accuracy of the model, so I applied importance function in R for choosing the most important features, the accuracy was improving, now I am going to apply the same method on mixed data, I need to know if I can apply it directly (without converting the data from type to another type) or noNoor
I think yes, as the VI measures are not effected by the data type.loki

2 Answers

1
votes

Yes, it is possible to include factorial (even ordered) variables for variable importance measures and classification / regression in R.

See this reproducible example:

library(randomForest)

df <- iris
df$Petal.Width <- as.factor(df$Petal.Width)
str(df)
# 'data.frame': 150 obs. of  5 variables:
# $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : Factor w/ 22 levels "0.1","0.2","0.3",..: 2 2 2 2 2 4 3 2 2 1 ...
# $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

rfmodel <- randomForest(x = df[,1:4], 
                        y = df$Species, 
                        importance = T)
importance(rfmodel)
#                 setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini
# Sepal.Length 11.266441   8.036164 13.480521            15.940870        14.152530
# Sepal.Width   6.394913   4.071819  5.076422             7.869699         2.880664
# Petal.Length 43.532850  39.802356 46.246262            60.663778        53.622069
# Petal.Width  14.272307  24.389310 19.109018            26.923048        28.617028
0
votes

If you use randomForrest function from randomForrest package you don't have to convert independent categorical variables into separate columns for each value.

Although, you need to ensure the dependent (predicted) variable is either a factor (for classification) or numeric (for regression).