0
votes

Is it possible to generate a decision forest whose trees are exactly the same? Please note that this is an experimental question. As far as I understand random forests have two parameters that lead to the 'randomness' compared to a single decision tree:

1) number of features randomly sampled at each node of a decision tree, and

2) number of training examples drawn to create a tree.

Intuitively, if I set these two parameters to their maximum values, then I should be avoiding the 'randomness', hence each created tree should be exactly the same. Because all the trees would exactly be the same, I should be achieving the same results regardless the number of trees in the forest or different runs (i.e. different seed values).

I have tested this idea using the randomForest library within R. I think the two aforementioned parameters correspond to 'mtry' and 'sampsize' respectively. I have set these values to their maximum, but unfortunately there is still some randomness left, as the OOB-error estimates vary depending on the number of trees in the forest?!

Would you please help me understand how to remove all the randomness in a random decision forest, prefarably using the arguments of the randomForest library within R?

1
I set the maximum value for 'sampsize', and check the split points of the root nodes for different trees (with the function 'getTree'). Somehow, the split points differ on different trees for the same variables. Intuitively, the split points of the same variables should be the same as I have used all the available data isn't it?! Please help me understand where this randomness comes from!?banbar

1 Answers

1
votes

In addition to mtry and sampsize, there's another relevant argument in randomForest(): replace. By default the sampling of data points to grow each tree is done with replacement. If you want all data points to be used in all trees, not only you need to set sampsize to the number of data points, but also set replace=FALSE.

Here's a toy example to show that you can get a forest of identical trees:

library(randomForest)

set.seed(17)

x <- matrix(sample(5, 50, replace=TRUE), 10, 5)

y <- factor(sample(2, 10, replace=TRUE))

rf1 <- randomForest(x, y, mtry=ncol(x), sampsize=nrow(x), replace=FALSE, ntree=5)

You can then use getTree(rf1, 1), etc. to check that all trees are identical.