I am using Scikit-Learn's RandomForestClassifier to predict multiple labels of documents. Each document has 50 features, no document has any missing features, and each document has at least one label associated with it.
clf = RandomForestClassifier(n_estimators=20).fit(X_train,y_train)
preds = clf.predict(X_test)
However, I have noticed that after prediction there are some samples that are assigned no labels, even though the samples were not missing label data.
>>> y_test[0,:]
array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> preds[0,:]
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0.])
The results of predict_proba align with those of predict.
>>> probas = clf.predict_proba(X_test)
>>> for label in probas:
>>> print (label[0][0], label[0][1])
(0.80000000000000004, 0.20000000000000001)
(0.94999999999999996, 0.050000000000000003)
(0.94999999999999996, 0.050000000000000003)
(1.0, 0.0)
(1.0, 0.0)
(1.0, 0.0)
(0.94999999999999996, 0.050000000000000003)
(0.90000000000000002, 0.10000000000000001)
(1.0, 0.0)
(1.0, 0.0)
(0.94999999999999996, 0.050000000000000003)
(1.0, 0.0)
(0.94999999999999996, 0.050000000000000003)
(0.84999999999999998, 0.14999999999999999)
(0.90000000000000002, 0.10000000000000001)
(0.90000000000000002, 0.10000000000000001)
(1.0, 0.0)
(0.59999999999999998, 0.40000000000000002)
(0.94999999999999996, 0.050000000000000003)
(0.94999999999999996, 0.050000000000000003)
(1.0, 0.0)
Each output above shows that for each label, a higher marginal probability has been assigned to the label not appearing. My understanding of decision trees was that at least one label has to be assigned to each sample when predicting, so this leaves me a bit confused.
Is it expected behavior for a multilabel decision tree / random forest to be able to assign no labels to a sample?
UPDATE 1
The features of each document are probabilities of belonging to a topic according to a topic model.
>>>X_train.shape
(99892L, 50L)
>>>X_train[3,:]
array([ 5.21079651e-01, 1.41085893e-06, 2.55158446e-03,
5.88421331e-04, 4.17571505e-06, 9.78104112e-03,
1.14105667e-03, 7.93964896e-04, 7.85177346e-03,
1.92635026e-03, 5.21080173e-07, 4.04680406e-04,
2.68261102e-04, 4.60332012e-04, 2.01803955e-03,
6.73533276e-03, 1.38491129e-03, 1.05682475e-02,
1.79368409e-02, 3.86488757e-03, 4.46729289e-04,
8.82488825e-05, 2.09428702e-03, 4.12810745e-02,
1.81651561e-03, 6.43641626e-03, 1.39687081e-03,
1.71262909e-03, 2.95181902e-04, 2.73045908e-03,
4.77474778e-02, 7.56948497e-03, 4.22549636e-03,
3.78891036e-03, 4.64685435e-03, 6.18710017e-03,
2.40424583e-02, 7.78131179e-03, 8.14288762e-03,
1.05162547e-02, 1.83166124e-02, 3.92332202e-03,
9.83870257e-03, 1.16684231e-02, 2.02723299e-02,
3.38977762e-03, 2.69966332e-02, 3.43221675e-02,
2.78571022e-02, 7.11067964e-02])
The label data was formatted using MultiLabelBinarizer and looks like:
>>>y_train.shape
(99892L, 21L)
>>>y_train[3,:]
array([0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
UPDATE 2
The output of predict_proba above suggested above that the assigning of no classes might be an artifact of trees voting on labels (there are 20 trees and all probabilities are approximately multiples of 0.05). However, using a single decision tree, I still find there are some samples that are assigned no labels. The output looks similar to predict_proba above, in that for each sample there is a probability a given label is assigned or not to the sample. This seems to suggest that at some point the decision tree is turning the problem into binary classification, though the documentation says that the tree takes advantage of label correlations.