1
votes

I recently built a random forest model using the ranger package in R. However, I noticed that the predictions stored in the ranger object during training (accessible with model$predictions) do not match the prediction I get if I run the predict command on the same dataset using the model created. The following code reproduces the problem on the mtcars dataset. I created a binary variable just for the sake of converting this to a classification problem though I saw similar results with regression trees as well.

library(datasets)
library(ranger)
mtcars <- mtcars
mtcars$mpg2 <- ifelse(mtcars$mpg > 19.2 , 1, 0)
mtcars <- mtcars[,-1]
mtcars$mpg2 <- as.factor(mtcars$mpg2)
set.seed(123)
mod <- ranger(mpg2 ~ ., mtcars, num.trees = 20, probability = T)
mod$predictions[1,] # Probability of 1 = 0.905
predict(mod, mtcars[1,])$predictions # Probability of 1 = 0.967

This problem also carries on to the randomForest package where I observed a similar problem reproducible with the following code.

library(randomForest)
set.seed(123)
mod <- randomForest(mpg2 ~ ., mtcars, ntree = 20)
mod$votes[1,]
predict(mod, mtcars[1,], type = "prob")

Can someone please tell me why this is happening? I would expect the results to be the same. Am I doing something wrong or is there an error in my understanding of some inherent property of random forest that leads to this scenario?

2

2 Answers

3
votes

I think you may want to look a little more deeply into how a random forest works. I really recommend Introduction to Statistical Learning in R (ISLR), which is available for free online here.

That said, I believe the main issue here is that you are treating the mod$votes value and the predict() value as the same, when they are not quite the same thing. If you look at the documentation of the randomForest function, the mod$votes or mod$predicted values are out-of-bag ("OOB") predictions for the input data. This is different from the value that the predict() function produces, which evaluates an observation on the model produced by randomForest(). Typically, you would want to train the model on one set of data, and use the predict() function on the test set.

Finally, you may need to re-run your set.seed() function every time your make the random forest if you want to achieve the same results for the mod object. I think there is a way to set the seed for an entire session, but I am not sure. This looks like a useful post: Fixing set.seed for an entire session

Side note: Here, you are not specifying the number of variables to use for each tree, but the default is good enough in most cases (check the documentation for each of the random forest functions you are using for the default). Maybe you are doing that in your actual code and didn't include it in your example, but I thought it was worth mentioning.

Hope this helps!

Edit: I tried training the random forest using all of the data except for the first observation (Mazda RX4) and then used the predict function on just that observation, which I think illustrates my point a bit better. Try running something like this:

library(randomForest)
set.seed(123)
mod <- randomForest(mpg2 ~ ., mtcars[-1,], ntree = 200)
predict(mod, mtcars[1,], type = "prob")
0
votes

As you have converted mpg to mpg2, was expecting that you want to build classification model. But nevertheless mod$predictions gives you probability while your model is trying to learn from your data points and predict(mod,mtcars[,1:10])$predictions option gives probability from trained model. Have run same code with Probability = F, and got below result, you can see prediction from trained model is prefect whereas from mod$predictions option we have 3 miss classifications.

mod <- ranger(mpg2 ~ ., mtcars, num.trees = 20, probability = F) 

> table(mtcars$mpg2,predict(mod, mtcars[,1:10])$predictions)

     0  1
  0 17  0
  1  0 15
> table(mtcars$mpg2,mod$predictions)

     0  1
  0 15  2
  1  1 14