3
votes

I found this thread from 2014 and the answer states that no, sklearn random forest classifier cannot handle categorical variables (or at least not directly). Has the answer changed in 2020?

I want to feed gender as a feature for my model. However, gender can take on three values: M, F of np.nan. If I encode this column into three dichotomous columns, how can the random forest classifier know that these three columns represent a single feature?

Imagine max_features = 7. When training a given tree, it will randomly randomly pick seven features. Suppose gender was chosen. If gender is split into three columns (gender_M, gender_F, gender_NA), will the random forest classifier always pick all three columns and count it as one feature, or is there a chance that it will only pick one or two?

1
Any model can handle categorical data encoded properly (For ex. One0hot encoding) - Divyanshu Srivastava
Yeah, but one hot encoding turns one column into multiple columns... - Arturo Sbr
Yes. And I dont see any harm in that. - Divyanshu Srivastava
If only one of the columns is selected when training a tree, the tree will only make splits based on one category from then entire range of categories. - Arturo Sbr
@DivyanshuSrivastava inflating the number of features is indeed an issue; I suggest you think it more closely - desertnaut

1 Answers

1
votes

If max_features is set to a value lower than the actual amount of columns (which is the advisable approach, see the recommended values for max_features in the docs), then yes, there is a chance that for a given estimator in the random forest only a subset of the dummy columns is considered.

But that is not necessarily too bad. In decision trees, a feature is selected as node at a given level aiming at optimizing some metric, independently from the other features, that is, only considering the actual feature and the target. So in a sense the model will not treat these dummy columns as belonging to the same feature.

In general though, the best approach for binary features is to come up with an appropriate method to fill missing values, and convert it into a single column encoded to 0s and 1s.