2
votes

I did find threads about scikit catagorical variables. But I could not find a easy answer. I do realise while building decision tree, sklearn errors out for catagorical data and there are suggestions for Vectorizer etc. I tried evrything yet I am not able to create a decision tree. My table has a lot of columns with strings and I tried vectorizer,multilabelbinerizer etc. Nothings seenms to work. I am not able to export_graphviz and display the tree, as there is no tree at all. I am pretty new to this. I sincerely request to help me understand how to handle these columns. I am splitting the data 80-20 for training and test. Then I am trying to build a tree. Just a quick piece of code:

  dtree=DecisionTreeClassifier(random_state=0)
  mlb = preprocessing.MultiLabelBinarizer()
  n_train = mlb.fit_transform(train)
  n_test = mlb.transform(test) 
  dec_tree=dtree.fit(n_train,n_test)

I do get this as answer and I am confused:

  DecisionTreeClassifier(class_weight=None, criterion='gini',  
        max_depth=None,
        max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
        min_samples_split=2, min_weight_fraction_leaf=0.0,
        random_state=0, splitter='best')

Please advise on how to proceed.

2

2 Answers

0
votes

In order to make your categorical variables usable by the classifier, one possibility is to use OneHodEncoder from scikit-learn.

You should watch out that no variable has levels with too few occurrences. If you don't want to or cannot check this manually, use a threshold on the variance of variables with VarianceThreshold.


Other possibility if you were using pandas DataFrame as structure, pandas.get_dummies(DataFrame["variable"]) will build the dummy variables for you.

0
votes

try this to encode your features.you should use your label(column that you want to predict) as second parameter in dtree.fit() function but you use your test data as second parameter. check this to know correct way to use DecisionTreeClassifier fit function