Here is my code:
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import preprocessing
import os
import subprocess
def categorical_split():
colors = ['blue', 'green', 'yellow', 'green', 'red']
sizes = ['small', 'large', 'medium', 'large', 'small']
size_encoder = preprocessing.LabelEncoder()
sizes = size_encoder.fit_transform(sizes).reshape(-1, 1)
color_encoder = preprocessing.LabelEncoder()
colors = size_encoder.fit_transform(colors).reshape(-1, 1)
dt = DecisionTreeClassifier( random_state=99)
dt.fit(colors, sizes)
with open("dt.dot", 'w') as f:
export_graphviz(dt, out_file=f,
feature_names='colors')
command = ["dot", "-Tpng", "dt.dot", "-o", "dt.png"]
subprocess.check_call(command)
categorical_split()
It generates the following decision tree:
Since decision tree in scikit-learn can not handle categorical variables directly, I had to use LabelEncoder. On the graph we see splits like c<=1.5
. This kind of split indicates that categorical variables are treated like ordinal variable and split is preserving order. If my data does not have order this kind approach is detrimental. Is there way around it? If you are planning to suggest one-hot encoding, could you please provide an example(code) how it is going to help.