1
votes

I have a very specific question regarding random forests and its implementation in scikit. I constructed a forest, and prediction works just fine so far. However, I need to know which particular features end up in the leaves of single trees after fitting the forest.

I already understood the hierarchical structure, where the random forest builds single Decision trees, which I can access by the "estimators_" attribute, and further down the road each Decision tress has a "tree_" attribute. In this tree I find information about the specific nodes, leafs and splits, but I am unable to find out which specific features end up in a leaf, or even which features are assigned to that specific tree ?

Just to clarify I do not intend to manually analyze tree nodes, but I want to have a similarity measure, based on how frequently a certain set of features is ending up in leafs of the tree together.

Thank you very much !

P.s. just in case anyone is interested, I got the idea for this similarity measure from this paper: http://www.sciencedirect.com/science/article/pii/S1053811912009834

edit: Ok, I have got further. Apparently there is a method for the Ensemble that "Finds the terminal region (=leaf node) for each sample in X." This can be found here (line 2465) https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx

I already fiddled around, but can't fully understand the code. I can make rf.apply(np.array([i,j])), where i,j is the number of the sample in the 2D matrix used to fit the model, and it returns me the leaves in which it is situated. Great ! But the only strange thing is i can use numbers way greater then my actual sample ? Does it assume the last sample if the index is way to big, or am I doing something wrong ?

For test purposes I made a 10x5 matrix, but i can index rf.apply(np.array([500,2333])) without error ?

Sorry but I'm not entirely sure on how to feed data to this function.

1

1 Answers

0
votes

You can pull out the decision nodes in Python, as explained in explained in how extraction decision rules of random forest in python. Can then traveral the trees to check which features led to each leaf.