2
votes

I'm doing a Random Forest implementation (for classification), and I have some questions regarding the tree growing algorithm mentioned in literature.

When training a decision tree, there are 2 criteria to stop growing a tree:
a. Stop when there are no more features left to split a node on.
b. Stop when the node has all samples in it belonging to the same class.

Based on that,
1. Consider growing one tree in the forest. When splitting a node of the tree, I randomly select m of the M total features, and then from these m features I find that one feature with maximum information gain. After I've found this one feature, say f, should I remove this feature from the feature list, before proceeding down to the children of the node? If I don't remove this feature, then this feature might get selected again down the tree.
If I implement the algorithm without removing the feature selected at a node, then the only way to stop growing the tree is when the leaves of the tree become "pure". When I did this, I got the "maximum recursion depth" reached error in Python, because the tree couldn't reach that "pure" condition earlier.
The RF literature even those written by Breiman say that the tree should be grown to the maximum . What does this mean?

2. At a node split, after selecting the best feature to split on (by information gain), what should be the threshold on which to split? One approach is to have no threshold, create one child node for every unique value of the feature; but I've continuous-valued features too, so that means creating one child node per sample!

1
I doubt you're getting to the recursion depth without a bug in your code. Since the depth of the tree should be log(N) then to be getting to a depth of thousands you'd be talking about a crazy big number of samples..... - John Greenall
A tree of depth 20 is quite a complex tree- that could be as many as 2**20 = 1M leaves : something is broken if you have more leaves than datapoints - John Greenall
@JohnGreenall Yeah, there may be some bug.. but on the other hand, if I remove features that are already selected for some split as I go down, then I don't get this problem; because eventually I run out of features to split on. But what is the RF standard on this? Should I remove features, or is it okay to keep them and let them be repeated down the tree? - sanjeev mk
features can definitely be repeated down the tree - John Greenall
Extremely Random Trees is one of the most popular flavours of the RF algorithm. All your questions are answered in the paper. Split on Information Gain and select the best of N thresholds for continuous features. montefiore.ulg.ac.be/~ernst/uploads/news/id63/… - John Greenall

1 Answers

1
votes

Q1

  • You shouldn't remove the features from M. Otherwise it will not be able to detect some types of relationships (ex: linear relationships)
  • Maybe you can stop earlier, in your condition it might go up to leaves with only 1 sample, this will have no statistical significance. So it's better to stop at say, when the number of samples at leaf is <= 3

Q2

  • For continuous features maybe you can bin them to groups and use them to figure out a splitting point.