2
votes

Here is my code:

from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import preprocessing
import os
import subprocess

def categorical_split():
    colors = ['blue', 'green', 'yellow', 'green', 'red']
    sizes = ['small', 'large', 'medium', 'large', 'small']

    size_encoder = preprocessing.LabelEncoder()
    sizes = size_encoder.fit_transform(sizes).reshape(-1, 1)

    color_encoder = preprocessing.LabelEncoder()
    colors = size_encoder.fit_transform(colors).reshape(-1, 1)

    dt = DecisionTreeClassifier( random_state=99)
    dt.fit(colors, sizes)

    with open("dt.dot", 'w') as f:
        export_graphviz(dt, out_file=f,
                        feature_names='colors')

    command = ["dot", "-Tpng", "dt.dot", "-o", "dt.png"]
    subprocess.check_call(command)

categorical_split()

It generates the following decision tree:enter image description here

Since decision tree in scikit-learn can not handle categorical variables directly, I had to use LabelEncoder. On the graph we see splits like c<=1.5. This kind of split indicates that categorical variables are treated like ordinal variable and split is preserving order. If my data does not have order this kind approach is detrimental. Is there way around it? If you are planning to suggest one-hot encoding, could you please provide an example(code) how it is going to help.

1

1 Answers

1
votes

This is actually a perfectly valid approach, and shouldn't be detrimental to your model performance. It does make the model a little hard to read though. One nice approach is to use pd.get_dummies since this will take care of the model names for you:

import pandas as pd
df = pd.DataFrame({'colors':colors})
df_encoded = pd.get_dummies(df)
dt.fit(df_encoded, sizes)

with open("dt.dot", 'w') as f:
    export_graphviz(dt, out_file=f,
                    feature_names=df_encoded.columns)

command = ["dot", "-Tpng", "dt.dot", "-o", "dt.png"]
subprocess.check_call(command)

enter image description here