2
votes

I've got a random forest which currently is built on 100 different variables. I want to be able to select only the "most important" variables to build my random forest on to try and improve performance but I don't know where to start other than getting the importance from rf$importance.

My data just consists of numerical variables which have all been scaled.

Below is my RF code:

rf.2 = randomForest(x~., data=train,importance=TRUE, ntree=1501)

#train
rf_prob_train = data.frame(predict(rf.2, newdata=train, type="prob"))
rf_prob_train <-data.frame(rf_prob_train$X0)
val_rf_train<-cbind(rf_prob_train,train$x)
names(val_rf_train)<-c("Probs","x")

##Run accuracy ratio
x<-data.frame(rcorr.cens(-val_rf_train$Probs, val_rf_train$x))
rf_train_AR<-x[2,1]
rf_train_AR

#test
rf_prob_test = data.frame(predict(rf.2, test, type="prob"))
rf_prob_test <-data.frame(rf_prob_test$X0)
val_rf_test<-cbind(rf_prob_test,test$x)
names(val_rf_test)<-c("Probs","x")

##Run accuracy ratio
x<-data.frame(rcorr.cens(-val_rf_test$Probs, val_rf_test$x))
rf_test_AR<-x[2,1]
rf_test_AR
1
Do you know or have an idea which variables might be multicolinear? I've found reducing the number of multicolinear variables helps. Also, are you normalizing continuous variables? That as well has delivered performance gains for me. But yeah, just calling them with $importance is basically how its done. You can also look at %variance explained, but they say more or less the same thing.SeldomSeenSlim
Thanks for this, tbh I don't really know what ones exactly but I can have a educated guess. Once I've called them with $importance do you know how to do the next step to then only including the more important ones? Currently I've just got a list of my variables and MeanDecreaseGiniuser2902494
You just have to decide for yourself which ones you want to keep and which you want to reject. When you look at MeanDecreaseGini, does it look asymptotic? You might just grab everything above about the inflection point and leave the rest. If you need help sub setting based on something like variance explained, comment back and I'll write it up as an answer.SeldomSeenSlim
Just FYI, random forest is very good at avoiding problems with collinearity and self-regularization. I strongly doubt you will see any benefit in performance from removing variables.vincentmajor
Well all the documentation I've read agrees with you @Vincentmajor, however, my personal experience in working with random forest has suggested that I get a better per variable %VarExplained when I reduce the number of multicolinear variables. This makes sense; if many variables describe more or less the same thing, they split the variance when both are included in the RF model. Depending on what you are using RF for, this may or may not be important. I find myself having to explain why I'm including variables in models far more often than I would like to, so for me, fewer is better.SeldomSeenSlim

1 Answers

7
votes

Busy day, so I couldn't get this to you sooner. This gives you the general idea using a generic data set.

library(randomForest)
library(datasets)

head(iris)
#To make our formula for RF easier to manipulate

var.predict<-paste(names(iris)[-5],collapse="+")
rf.form <- as.formula(paste(names(iris)[5], var.predict, sep = " ~ "))

print(rf.form)
#This is our current itteration of the formula we're using in RF

iris.rf<-randomForest(rf.form,data=iris,importance=TRUE,ntree=100)

varImpPlot(iris.rf)
#Examine our Variable importance plot

to.remove<-c(which(data.frame(iris.rf$importance)$MeanDecreaseAccuracy==min(data.frame(iris.rf$importance)$MeanDecreaseAccuracy)))
#Remove the variable with the lowest decrease in Accuracy (Least relevant variable)

#Rinse, wash hands, repeat

var.predict<-paste(names(iris)[-c(5,to.remove)],collapse="+")
rf.form <- as.formula(paste(names(iris)[5], var.predict, sep = " ~ "))

iris.rf<-randomForest(rf.form,data=iris,importance=TRUE,ntree=100)

varImpPlot(iris.rf)
#Examine our Variable importance plot

to.remove<-c(to.remove, which(data.frame(iris.rf$importance)$MeanDecreaseAccuracy==min(data.frame(iris.rf$importance)$MeanDecreaseAccuracy)))

#And so on...