scikit, catagorical columns, decision tree

Question

I did find threads about scikit catagorical variables. But I could not find a easy answer. I do realise while building decision tree, sklearn errors out for catagorical data and there are suggestions for Vectorizer etc. I tried evrything yet I am not able to create a decision tree. My table has a lot of columns with strings and I tried vectorizer,multilabelbinerizer etc. Nothings seenms to work. I am not able to export_graphviz and display the tree, as there is no tree at all. I am pretty new to this. I sincerely request to help me understand how to handle these columns. I am splitting the data 80-20 for training and test. Then I am trying to build a tree. Just a quick piece of code:

  dtree=DecisionTreeClassifier(random_state=0)
  mlb = preprocessing.MultiLabelBinarizer()
  n_train = mlb.fit_transform(train)
  n_test = mlb.transform(test) 
  dec_tree=dtree.fit(n_train,n_test)

I do get this as answer and I am confused:

  DecisionTreeClassifier(class_weight=None, criterion='gini',  
        max_depth=None,
        max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
        min_samples_split=2, min_weight_fraction_leaf=0.0,
        random_state=0, splitter='best')

Please advise on how to proceed.

Mathieu B Mathieu B · Accepted Answer · 2016-02-19T07:50:59

In order to make your categorical variables usable by the classifier, one possibility is to use OneHodEncoder from scikit-learn.

You should watch out that no variable has levels with too few occurrences. If you don't want to or cannot check this manually, use a threshold on the variance of variables with VarianceThreshold.

Other possibility if you were using pandas DataFrame as structure, pandas.get_dummies(DataFrame["variable"]) will build the dummy variables for you.

scikit, catagorical columns, decision tree

2 Answers