2
votes

I am new to ML in Python and very confused by how to implement a decision tree with categorical variables as they get automatically encoded by party and ctree in R.

I want to make a decision tree with two categorical independent features and one dependent class.

The dataframe I am using looks like this:

data
      title_overlap_quartile sales_rank_quartile rank_grp
    0                     Q4                  Q2    GRP 1
    1                     Q4                  Q3    GRP 1
    2                     Q2                  Q1    GRP 1
    3                     Q4                  Q1    GRP 1
    5                     Q2                  Q1    GRP 2

I understood that categorical features need to be encoded in scikit learn using labelencoder and/or one hot encoder.

First I tried to just use label encoder but that does not solve the problem since DecisionTreeClassifier started treating the encoded variables as continuous. Then I read from this post: Issue with OneHotEncoder for categorical features that the variable should first be encoded using label encoder and then encoded again using one hot encoder.

I tried to implement that on this dataset in the following way but am getting an error.

def encode_features(df, columns):
    le = preprocessing.LabelEncoder()
    ohe = preprocessing.OneHotEncoder(sparse=False)
    for i in columns:
        le.fit(df[i].unique())
        df[i+'_le'] = le.transform(df[i])
        df[i+'_le'] = df[i+'_le'].values.reshape(-1, 1)
        df[i+'_le'+'_ohe'] = ohe.fit_transform(df[i+'_le'])
    return(df)

data = encode_features(data, ['title_overlap_quartile', 'sales_rank_quartile'])


  File "/Users/vaga/anaconda2/envs/py36/lib/python3.5/site-packages/pandas/core/series.py", line 2800, in _sanitize_index
    raise ValueError('Length of values does not match length of ' 'index')

ValueError: Length of values does not match length of index

When I remove the ohe part from the function and run it outside , it runs but the results look weird:

def encode_features(df, columns):
    le = preprocessing.LabelEncoder()
    ohe = preprocessing.OneHotEncoder(sparse=False)
    for i in columns:
        le.fit(df[i].unique())
        df[i+'_le'] = le.transform(df[i])
        # df[i+'_le'] = df[i+'_le'].values.reshape(-1, 1)
        # df[i+'_le'+'_ohe'] = ohe.fit_transform(df[i+'_le'])
    return(df)

data = encode_features(data, ['title_overlap_quartile', 'sales_rank_quartile']) 

data['title_overlap_quartile_le'] = data['title_overlap_quartile_le'].values.reshape(-1, 1)

print(ohe.fit_transform(data['title_overlap_quartile_le']))

[[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]]

I also tried using pandas.get_dummies which converts the variable into multiple columns with binary coding and used it, but that again gets treated as a continuous variable by the decision tree classifier.

Can someone please help me with how to fit a decision tree using the categorical variables as categorical and output the tree diagram?

The code for fitting and drawing the tree I am using is:

clf = tree.DecisionTreeClassifier()
clf = clf.fit(data[['title_overlap_score', 'sales_rank_quartile']], data[['rank_grp']])

dot_data = tree.export_graphviz(clf, out_file=None, feature_names=data[['title_overlap_score', 'sales_rank_quartile']].columns,  
                         filled=True, rounded=True,  
                         special_characters=True)  

graph = graphviz.Source(dot_data)  
graph.render("new_tree")
1

1 Answers

2
votes

Although decision trees are supposed to handle categorical variables, sklearn's implementation cannot at the moment due to this unresolved bug. The current workaround, which is sort of convoluted, is to one-hot encode the categorical variables before passing them to the classifier.

Have you tried category_encoders? This is easier to handle, and can also be used within pipelines.

The latest yet to be released version of scikit-learn seems to allow string column types, without conversion to int.