Get all features in decision path to leaf node (Random Forest)

Question

I have the following example code for a simple random forest classifier on the iris dataset using just 2 decision trees. This code is best run inside a jupyter notebook.

# Setup
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
import numpy as np
# Set seed for reproducibility
np.random.seed(1015)

# Load the iris data
iris = load_iris()

# Create the train-test datasets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target)

np.random.seed(1039)

# Just fit a simple random forest classifier with 2 decision trees
rf = RandomForestClassifier(n_estimators = 2)
rf.fit(X = X_train, y = y_train)

# Define a function to draw the decision trees in IPython
# Adapted from: http://scikit-learn.org/stable/modules/tree.html
from IPython.display import display, Image
import pydotplus

# Now plot the trees individually
for dtree in rf.estimators_:
    dot_data = tree.export_graphviz(dtree
                                    , out_file = None
                                    , filled   = True
                                    , rounded  = True
                                    , special_characters = True)  
    graph = pydotplus.graph_from_dot_data(dot_data)  
    img = Image(graph.create_png())
    display(img)
    draw_tree(inp_tree = dtree)
    #print(dtree.tree_.feature)

The output for the first tree is:

As can be observed the first decision has 8 leaf nodes and the second decision tree (not shown) has 6 leaf nodes

How do I extract a simple numpy array which contains information for each decision tree, and each leaf node in the tree:

the classification outcome for that leaf node (e.g. most frequent class it predicted)
all the features (boolean) used in the decision path to that same leaf node?

In the above example we would have:

2 trees - {0, 1}
for tree {0} we have 8 leaf nodes indexed {0, 1, ..., 7}
for tree {1} we have 6 leaf nodes indexed {0, 1, ..., 5}
for each leaf node in each tree we have a single most frequent predicted class i.e. {0, 1, 2} for the iris dataset
for each leaf node we have a set of boolean values for the 4 features that were used to make that tree. Here if one of the 4 features is used one or more times in the decision path to a leaf node we count it as a True otherwise False if it is never used in the decision path to the leaf node.

Any help adapting this numpy array into the above code (loop) is appreciated.

Thanks

Have you had a look at the code in the tree class, in particular I think the code from the export_graphiz function is a good place to start github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/tree/… — piman314
when I try to run your code I get name 'draw_tree' is not defined any ideas why ? — user9238790
@user4687531 when I try to run your code I get name 'draw_tree' is not defined any ideas why ? — user9238790
The decision nodes are accessible in Python, see stackoverflow.com/questions/50600290/… — Jon Nordby

Createdd Createdd · Accepted Answer · 2020-08-20T07:34:19

Similar to the the questions here: how extraction decision rules of random forest in python

You can use the snippet @jonnor provided (I used it modified as well):

import numpy
from sklearn.model_selection import train_test_split
from sklearn import metrics, datasets, ensemble

def print_decision_rules(rf):

    for tree_idx, est in enumerate(rf.estimators_):
        tree = est.tree_
        assert tree.value.shape[1] == 1 # no support for multi-output

        print('TREE: {}'.format(tree_idx))

        iterator = enumerate(zip(tree.children_left, tree.children_right, tree.feature, tree.threshold, tree.value))
        for node_idx, data in iterator:
            left, right, feature, th, value = data

            # left: index of left child (if any)
            # right: index of right child (if any)
            # feature: index of the feature to check
            # th: the threshold to compare against
            # value: values associated with classes            

            # for classifier, value is 0 except the index of the class to return
            class_idx = numpy.argmax(value[0])

            if left == -1 and right == -1:
                print('{} LEAF: return class={}'.format(node_idx, class_idx))
            else:
                print('{} NODE: if feature[{}] < {} then next={} else next={}'.format(node_idx, feature, th, left, right))    


digits = datasets.load_digits()
Xtrain, Xtest, ytrain, ytest = train_test_split(digits.data, digits.target)
estimator = ensemble.RandomForestClassifier(n_estimators=3, max_depth=2)
estimator.fit(Xtrain, ytrain)

Another approach and for visualization:

For visualizing the decision path you can use the library dtreeviz from https://explained.ai/decision-tree-viz/index.html

They have fantastic visualizations like:

Source https://explained.ai/decision-tree-viz/images/samples/sweets-TD-3-X.svg

Look at their shadowDecisionTree implementation for getting more information on the decision path. In https://explained.ai/decision-tree-viz/index.html they also provide an example with

shadow_tree = ShadowDecTree(tree_model, X_train, y_train, feature_names, class_names)

Then you could use something like the get_leaf_sample_countsmethod.

Get all features in decision path to leaf node (Random Forest)

1 Answers