6
votes

I'm dealing with a large dataset that involves more than 100 features (which are all relevant because they have already been filtered; the original dataset had over 500 features). I created a random forest model via the train() function from the caret package and using the "ranger" method.

Here's the question: how does one extract all of the variables by importance, as opposed to only the top 20 most important variables? The varImp() function yields only the top 20 variables by default.

Here's some sample code (minus the training set, which is very large):

library(caret)
rforest_model <- train(target_variable ~ .,
                       data = train_data_set,
                       method = "ranger",
                       importance = "impurity)

And here's the code for extracting variable importance:

varImp(rforest_model)
1
Note that importance() doesn't work in this case: importance(rforest_model) results in the following error message: Error in UseMethod("importance") : no applicable method for 'importance' applied to an object of class "c('train', 'train.formula')"Flavio Abdenur

1 Answers

19
votes

The varImp function extracts importance for all variables (even if they are not used by the model), it just prints out the top 20 variables. Consider this example:

library(mlbench) #for data set
library(caret)
library(tidyverse)

set.seed(998)
data(Ionosphere)

rforest_model <- train(y = Ionosphere$Class,
                       x = Ionosphere[,1:34],
                       method = "ranger",
                       importance = "impurity")

nrow(varImp(rforest_model)$importance) #34 variables extracted

lets check them:

varImp(rforest_model)$importance %>% 
  as.data.frame() %>%
  rownames_to_column() %>%
  arrange(Overall) %>%
  mutate(rowname = forcats::fct_inorder(rowname )) %>%
  ggplot()+
    geom_col(aes(x = rowname, y = Overall))+
    coord_flip()+
    theme_bw()

enter image description here

note that V2 is a zero variance feature in this data set hence it has 0 importance and is not used by the model at all.