I have a very specific question regarding random forests and its implementation in scikit. I constructed a forest, and prediction works just fine so far. However, I need to know which particular features end up in the leaves of single trees after fitting the forest.
I already understood the hierarchical structure, where the random forest builds single Decision trees, which I can access by the "estimators_" attribute, and further down the road each Decision tress has a "tree_" attribute. In this tree I find information about the specific nodes, leafs and splits, but I am unable to find out which specific features end up in a leaf, or even which features are assigned to that specific tree ?
Just to clarify I do not intend to manually analyze tree nodes, but I want to have a similarity measure, based on how frequently a certain set of features is ending up in leafs of the tree together.
Thank you very much !
P.s. just in case anyone is interested, I got the idea for this similarity measure from this paper: http://www.sciencedirect.com/science/article/pii/S1053811912009834
edit: Ok, I have got further. Apparently there is a method for the Ensemble that "Finds the terminal region (=leaf node) for each sample in X." This can be found here (line 2465) https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx
I already fiddled around, but can't fully understand the code. I can make rf.apply(np.array([i,j])), where i,j is the number of the sample in the 2D matrix used to fit the model, and it returns me the leaves in which it is situated. Great ! But the only strange thing is i can use numbers way greater then my actual sample ? Does it assume the last sample if the index is way to big, or am I doing something wrong ?
For test purposes I made a 10x5 matrix, but i can index rf.apply(np.array([500,2333])) without error ?
Sorry but I'm not entirely sure on how to feed data to this function.