Feature Importance for machine learning models in (Caret)package

Question

I have a question regarding the feature importance function in the Caret package.

I have a dataset which has more numeric and factor features. I used the command below to get the feature importance of the model. It gives me the importance of each (sub_feature) for the factor variables. However, I just want the importance of the feature itself without go in detail for each factor of the feature.

gbmImp <- caret::varImp(xgb1, scale = TRUE)

kmacierzanka kmacierzanka · Accepted Answer · 2020-06-18T06:58:26

I will create some example data as we don't have any from your question:

library(caret)

# example data
df <- data.frame("x" = rnorm(100),
                 "fac" = as.factor(sample(c(rep("A", 30), rep("B", 35), rep("C", 35)))),
                 "y" = as.numeric((rpois(100, 4))))
# model
model <- train(y ~ ., method = "glm", data = df)
# feature importance
varImp(model, scale = TRUE)

This returns the feature importance that you do not want in your question:

# glm variable importance
#
#      Overall
# facB  100.00
# facC   13.08
# x       0.00

You can convert the factor variables to numeric and do the same thing:

# make the factor variable numeric
trans_df <- transform(df, fac = as.numeric(fac))
# model
trans_model <- train(y ~ ., method = "glm", data = trans_df)
# feature importance
varImp(trans_model, scale = TRUE)

This returns the importance for the 'overall' feature:

# glm variable importance
# 
#     Overall
# x       100
# fac       0

However, I do not know whether the as.numeric() operation on the factor variable doesn't result in a different feature importance when we run varImp(trans_model, scale = TRUE).

Also, check out this SO thread if you find that your specific factor/character variables are problematic when converting to numeric.

Feature Importance for machine learning models in (Caret)package

1 Answers