0
votes

In attempting to understand how scikit decision tree behaves for onehot encoded data I have following :

X = [[1,0,1] , [1,1,1]]
Y = [1,2]

clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(X, Y)

print(clf.predict([1,0,1]))
print(clf.predict([1,1,1]))

print(clf.predict_proba([1,0,1]))
print(clf.predict_proba([1,1,1]))

Which returns :

[1]
[2]
[[ 1.  0.]]
[[ 0.  1.]]

Reading doc for predict_proba http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.predict_proba states following should be returned :

p : array of shape = [n_samples, n_classes], or a list of n_outputs such arrays if n_outputs > 1. The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.

Probability of correctness for given input value should be returned ? How return values [[ 1. 0.]] , [[ 0. 1.]] correspond to class probabilities for input samples ?

1

1 Answers

1
votes

For instance clf.predict_proba([1,0,1]) gives the following:

[[         1.                          0.        ]]                # sample 1
#          ^                           ^
# probability for class 1     probability for class 2

So the prediction says the probability of this sample [1,0,1] to be class 1 is 1, to be class 2 is 0. So the prediction should be 1 which is the same as clf.predict([1,0,1]) gives you. This could also be other values for instance [[0.8, 0.2]], so the class with the largest probability is considered as the predicted value.