Samples with no label assignment using multilabel random forest in scikit-learn

Question

I am using Scikit-Learn's RandomForestClassifier to predict multiple labels of documents. Each document has 50 features, no document has any missing features, and each document has at least one label associated with it.

clf = RandomForestClassifier(n_estimators=20).fit(X_train,y_train)
preds = clf.predict(X_test)

However, I have noticed that after prediction there are some samples that are assigned no labels, even though the samples were not missing label data.

>>> y_test[0,:]
array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> preds[0,:]
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
    0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

The results of predict_proba align with those of predict.

>>> probas = clf.predict_proba(X_test)
>>> for label in probas:
>>>    print (label[0][0], label[0][1])
(0.80000000000000004, 0.20000000000000001)
(0.94999999999999996, 0.050000000000000003)
(0.94999999999999996, 0.050000000000000003)
(1.0, 0.0)
(1.0, 0.0)
(1.0, 0.0)
(0.94999999999999996, 0.050000000000000003)
(0.90000000000000002, 0.10000000000000001)
(1.0, 0.0)
(1.0, 0.0)
(0.94999999999999996, 0.050000000000000003)
(1.0, 0.0)
(0.94999999999999996, 0.050000000000000003)
(0.84999999999999998, 0.14999999999999999)
(0.90000000000000002, 0.10000000000000001)
(0.90000000000000002, 0.10000000000000001)
(1.0, 0.0)
(0.59999999999999998, 0.40000000000000002)
(0.94999999999999996, 0.050000000000000003)
(0.94999999999999996, 0.050000000000000003)
(1.0, 0.0)

Each output above shows that for each label, a higher marginal probability has been assigned to the label not appearing. My understanding of decision trees was that at least one label has to be assigned to each sample when predicting, so this leaves me a bit confused.

Is it expected behavior for a multilabel decision tree / random forest to be able to assign no labels to a sample?

UPDATE 1

The features of each document are probabilities of belonging to a topic according to a topic model.

>>>X_train.shape
(99892L, 50L)
>>>X_train[3,:]
array([  5.21079651e-01,   1.41085893e-06,   2.55158446e-03,
     5.88421331e-04,   4.17571505e-06,   9.78104112e-03,
     1.14105667e-03,   7.93964896e-04,   7.85177346e-03,
     1.92635026e-03,   5.21080173e-07,   4.04680406e-04,
     2.68261102e-04,   4.60332012e-04,   2.01803955e-03,
     6.73533276e-03,   1.38491129e-03,   1.05682475e-02,
     1.79368409e-02,   3.86488757e-03,   4.46729289e-04,
     8.82488825e-05,   2.09428702e-03,   4.12810745e-02,
     1.81651561e-03,   6.43641626e-03,   1.39687081e-03,
     1.71262909e-03,   2.95181902e-04,   2.73045908e-03,
     4.77474778e-02,   7.56948497e-03,   4.22549636e-03,
     3.78891036e-03,   4.64685435e-03,   6.18710017e-03,
     2.40424583e-02,   7.78131179e-03,   8.14288762e-03,
     1.05162547e-02,   1.83166124e-02,   3.92332202e-03,
     9.83870257e-03,   1.16684231e-02,   2.02723299e-02,
     3.38977762e-03,   2.69966332e-02,   3.43221675e-02,
     2.78571022e-02,   7.11067964e-02])

The label data was formatted using MultiLabelBinarizer and looks like:

>>>y_train.shape
(99892L, 21L)
>>>y_train[3,:]
array([0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

UPDATE 2

The output of predict_proba above suggested above that the assigning of no classes might be an artifact of trees voting on labels (there are 20 trees and all probabilities are approximately multiples of 0.05). However, using a single decision tree, I still find there are some samples that are assigned no labels. The output looks similar to predict_proba above, in that for each sample there is a probability a given label is assigned or not to the sample. This seems to suggest that at some point the decision tree is turning the problem into binary classification, though the documentation says that the tree takes advantage of label correlations.

@ahajib I have updated the question with a sample. Please let me know if you need more information. — Ryan Gallagher
@RyanGallagher have you ever found the solution/issue? I'm having the exact same issue, thanks in advance. — 0x4a50

Michael Brundage Michael Brundage · Accepted Answer · 2016-10-19T07:52:26

This can happen if the train and test data are scaled differently, or otherwise drawn from different distributions (e.g., if the tree learned to split on values that occur in train but don't occur in test).

You could inspect the trees to try to get a better understanding of what's happening. To do this, look at the DecisionTreeClassifier instances in clf.estimators_ and visualize their .tree_ properties (for example, using sklearn.tree.export_graphviz())

Samples with no label assignment using multilabel random forest in scikit-learn

1 Answers