Variable Importance for Caret Random Forest Regression

Question

I have trouble understanding the exact meaning of the feature importance scores in caret for RF regression. As you know there are many potential importance measures for RF. However, there is no clear indication which one is used.

Here is a toy example:


data(iris)


y_train = iris['Sepal.Length']
X_train = iris[2:4]

mdl_rf_inner <- caret::train(X_train, y_train$Sepal.Length, method = "rf",
                             preProcess = c("center", "scale"),
                             ntrees = 1000, importance = T)

feat_imp_2 <- caret::varImp(mdl_rf_inner, scale=F)

Resulting in:

rf variable importance

             Overall
Petal.Length   48.51
Sepal.Width    23.67
Petal.Width    17.15

Please keep in mind that I am predicting sepal length, so despite using iris data it is a regression problem. I read the docs and there is no clear indication as to which variable importance is being calculated (Gini-impurity decrease?, mse decrease?, permuation importance?, out of bag?, etc., etc.).

To further complicate things, the train function also has the importance = T argument, which doesn't really seem to serve a clear purpose when using varImp(). Is that correct?

I would greatly appreciate your insights on this.

Best wishes!

This doesn't appear to be a specific programming question that's appropriate for Stack Overflow. If you have general questions about interpreting the results from statistical models, then you should ask such questions over at Cross Validated instead. You are more likely to get better answers there. — MrFlick

StupidWolf StupidWolf · Accepted Answer · 2020-12-19T03:56:23

If you read the help manual for varImp (?varImp):

*Random Forest*: ‘varImp.randomForest’ and ‘varImp.RandomForest’
are wrappers around the importance functions from the
‘randomForest’ and ‘party’ packages, respectively.

What it is effectively doing is using the importance() function in randomForest on your final model to give you:

randomForest::importance(mdl_rf_inner$finalModel)
             %IncMSE IncNodePurity
Sepal.Width  26.96516      8.014371
Petal.Length 44.64568     64.381750
Petal.Width  18.27348     27.448665

Compare with:

caret::varImp(mdl_rf_inner, scale=FALSE)
rf variable importance

             Overall
Petal.Length   44.65
Sepal.Width    26.97
Petal.Width    18.27

It is the %IncMSE scaled by their individual SD. you can read more from the help page for randomForest::importance.

For %IncMSE you need to specify importance=TRUE when running the randomForest model.

Variable Importance for Caret Random Forest Regression

1 Answers