0
votes

Hi this is a purely theoretical question which i cant get my head around ( and could be completely wrong)

With random forest regressions - you grow n number of trees, each tree uses a subset of the data and in some cases a subset of the available variables to predict the dependent variable. the average of these n number of trees is taken to give us a predicted value. however, is there any need to look at the distribution of predictions at the individual tree level? are we able to obtain a number that provides some certainty of the overall predicted value? i would assume that a more consistent number being produced at the individual tree level would be preferred than a wide variety of numbers?

Thanks in advance

1

1 Answers

0
votes

This method of determining variable importance has some drawbacks. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Methods such as partial permutations and growing unbiased trees can be used to solve the problem. If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups.