1
votes

I made a random forest in h2o using r. It has trees and leaves.

I want to know how many leaves. I like to compare my total number of rows to leaves.

I have 200 trees, 8 layers deep, require 5 rows per terminal leaf. Am I traversing much of my data? I have 20k rows.

Is there a clean way to count the number of leaves in the h2o randomForest?

1
A random forest model is actually a collection of trees. Why do you need to compute the number of leaves?Tim Biegeleisen
I want to use the ratio of leaves to rows to think about over-fitting or under-fitting in terms of variable importance. If I use each row 12 times, then there is likely a problem. If I use less than 0.5% of rows, then I could be under-sampling, even if it does look bootstrappy. Perhaps my trees use the same row 12 times, or 0.1 times. I want to think in terms of coverage, and don't know how to access the actual 'h2o' leaves.EngrStudent
This approach is not right AFAIK. The variables which are used to split early on usually dominate the behavior of a given tree. If you want variable importance, you should be using importance() or for a simpler option try the best constant model.Tim Biegeleisen
@TimBiegeleisen - I use "importance", aggressively. I have enjoyed getting into the fundamentals of the 'Boruta' package including the difference between z and gini. That is not the problem. My problem is that in some cases I have sparse data, and the tree looks like a subset of rows are dominating the fit. In some cases I have huge data and I want to be sure it is adequately sampled. All the "importance" in the world is going to be meaningless if I don't make a healthy forest. I want some better measure than an eyeball norm on a convergence plot.EngrStudent
What is a "row" in the context of random forests?Tim Biegeleisen

1 Answers

2
votes

You'll be glad to know H2O stores that information! E.g. (this is for the Iris data set)

m <- h2o.randomForest(1:4, 5, data)

When printing m I see:

number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves  max_leaves mean_leaves
            150               20217         1         9    3.72667          21         15     6.17333

So there are 926 leaves (6.17333 * 150).

To get it with code:

ms = m@model$model_summary
ms$number_of_trees * ms$mean_leaves