scikit- RandomForest categorical features

Question

My data has a lot of categorical features. I encode them using Dict_vectorizer.

 For example df['color']=['green','blue','white']
 df['size']=['small','big','medium']  .

I use RandomForest algorithm. When I check the values of feature_importances I get different values for each category. green = 2.45*10^-2 blue =6.2 *10^-3 and so on.

Shouldn't all encoded category values have same value of feature_importances. Like all categories of color have the same importance and all values of size have the same importance? Is there a way by which I can explicitly define feature_importances? Note: I understand

AN6U5 AN6U5 · Accepted Answer · 2015-07-09T16:07:04

When you binarize your categorical data you transform a single feature into multiple features. If the categorical values split the target variable differently, then they will have different feature importance. So to answer your question, No, the binariezed categorical data should not have the same feature importance.

Imagine your categories are "red", "blue", "green" and your target variable is a binary "Is ketchup" = 0 or 1. In that case a positive value for "green" will indicate that it isn't ketchup, but a value of zero doesn't mean it is ketchup since it could still be "blue" (and hence not ketchup). So the importance of the "red" feature is higher than the "green" or "blue" feature as it better splits the "is ketchup" target variable.

Note that Decision Trees in scikit-learn can handle both numerical and categorical data, so you don't actually need to binarize your data if you don't want to.

scikit- RandomForest categorical features

1 Answers