How to handle catagorical data while training decision tree using scikit-learn/ sklearn?

Question

I am new to scikit. I am trying to use the sklearn module to train a decision tree classifier. The data consists of some categorical features and some continuous features. But when I train the classifier, the categorical features, which have the values like 1,2,3 and so on, are treated as continuous. The result which I obtain gives a range even for the categorical values for the features. For example, I get a decision tree in which X[0]<4.5 implies a particular class, where X[0] is a categorical feature. Note that since here X[0] is a categorical, value 1 has nothing to do with value 2 but the classifier is combining them together. How do I deal with this?

And is there any way to increase the number of splits at the nodes which contains categorical feature and categories more than 2.

Pablo Fleurquin Pablo Fleurquin · Accepted Answer · 2015-09-22T10:36:06

Well you should encode the categorical integer features in the first place and then apply the DecisionTreeClassifier.

Try using OneHotEncoder from the library sklearn.preprocessing to preprocess your categorical feature.

For instance you should first do the following:

    from sklearn.preprocessing import OneHotEncoder
    ohe = OneHotEncoder(sparse=False)
    processed_X = ohe.fit_transform(X[['0']].values)

where '0' is your column feature X[0].

How to handle catagorical data while training decision tree using scikit-learn/ sklearn?

1 Answers