5
votes

When I'm running random forest model over my test data I'm getting different results for the same data set + model.

Here are the results where you can see the difference over the first column:

> table((predict(rfModelsL[[1]],newdata = a)) ,a$earlyR)

        FALSE TRUE
 FALSE    14    7
 TRUE     13   66

> table((predict(rfModelsL[[1]],newdata = a)) ,a$earlyR)

        FALSE TRUE
 FALSE    15    7
 TRUE     12   66

Although the difference is very small, I'm trying to understand what caused that. I'm guessing that predict has "flexible" classification threshold, although I couldn't find that in the documentation; Am I right?

Thank you in advance

1
Please read the documentation of the package randomForest a bit closer. It explains perfectly why this is documented behaviour. Your randomForest is a collection of trees, and each time you run the model you'll end up with a slightly different set of trees. That has nothing to do with the predict function, that is simply how random forests work. Next to that, questions about statistical techniques belong on stats.stackexchange.com , not on stackoverflow.Joris Meys

1 Answers

7
votes

I will assume that you did not refit the model here, but it is simply the predict call that is producing these results. The answer is probably this, from ?predict.randomForest:

Any ties are broken at random, so if this is undesirable, avoid it by using odd number ntree in randomForest()