I am new to ML in Python and very confused by how to implement a decision tree with categorical variables as they get automatically encoded by party
and ctree
in R
.
I want to make a decision tree with two categorical independent features and one dependent class.
The dataframe I am using looks like this:
data
title_overlap_quartile sales_rank_quartile rank_grp
0 Q4 Q2 GRP 1
1 Q4 Q3 GRP 1
2 Q2 Q1 GRP 1
3 Q4 Q1 GRP 1
5 Q2 Q1 GRP 2
I understood that categorical features need to be encoded in scikit learn using labelencoder and/or one hot encoder.
First I tried to just use label encoder but that does not solve the problem since DecisionTreeClassifier
started treating the encoded variables as continuous. Then I read from this post: Issue with OneHotEncoder for categorical features that the variable should first be encoded using label encoder and then encoded again using one hot encoder.
I tried to implement that on this dataset in the following way but am getting an error.
def encode_features(df, columns):
le = preprocessing.LabelEncoder()
ohe = preprocessing.OneHotEncoder(sparse=False)
for i in columns:
le.fit(df[i].unique())
df[i+'_le'] = le.transform(df[i])
df[i+'_le'] = df[i+'_le'].values.reshape(-1, 1)
df[i+'_le'+'_ohe'] = ohe.fit_transform(df[i+'_le'])
return(df)
data = encode_features(data, ['title_overlap_quartile', 'sales_rank_quartile'])
File "/Users/vaga/anaconda2/envs/py36/lib/python3.5/site-packages/pandas/core/series.py", line 2800, in _sanitize_index
raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index
When I remove the ohe
part from the function and run it outside , it runs but the results look weird:
def encode_features(df, columns):
le = preprocessing.LabelEncoder()
ohe = preprocessing.OneHotEncoder(sparse=False)
for i in columns:
le.fit(df[i].unique())
df[i+'_le'] = le.transform(df[i])
# df[i+'_le'] = df[i+'_le'].values.reshape(-1, 1)
# df[i+'_le'+'_ohe'] = ohe.fit_transform(df[i+'_le'])
return(df)
data = encode_features(data, ['title_overlap_quartile', 'sales_rank_quartile'])
data['title_overlap_quartile_le'] = data['title_overlap_quartile_le'].values.reshape(-1, 1)
print(ohe.fit_transform(data['title_overlap_quartile_le']))
[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
I also tried using pandas.get_dummies
which converts the variable into multiple columns with binary coding and used it, but that again gets treated as a continuous variable by the decision tree classifier.
Can someone please help me with how to fit a decision tree using the categorical variables as categorical and output the tree diagram?
The code for fitting and drawing the tree I am using is:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(data[['title_overlap_score', 'sales_rank_quartile']], data[['rank_grp']])
dot_data = tree.export_graphviz(clf, out_file=None, feature_names=data[['title_overlap_score', 'sales_rank_quartile']].columns,
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph.render("new_tree")