1
votes

I am new to Random Forests and I have a question about regression. I am using R package randomForests to calculate RF models.

My final goal is to select sets of variables important for prediction of a continuous trait, and so I am calculating a model, then I remove the variable with lowest mean decrease in accuracy, and I calculate a new model, and so on. This worked with RF classification, and I compared the models using the OOB errors from prediction (training set), development and validation data sets. Now with regression I want to compare the models based on %variation explained and MSE.

I was evaluating the results for MSE and %var explained, and I get exactly the same results when calculating manually using the prediction from model$predicted. But when I do model$mse, the value presented corresponds to the value of MSE for the last tree calculated, and the same happens for % var explained.

As an example you can try this code in R:

library(randomForest)
data("iris")
head(iris)

TrainingX<-iris[1:100,2:4] #creating training set - X matrix
TrainingY<-iris[1:100,1]  #creating training set - Y vector

TestingX<-iris[101:150,2:4]  #creating test set - X matrix
TestingY<-iris[101:150,1]  #creating test set - Y vector

set.seed(2)

model<-randomForest(x=TrainingX, y= TrainingY, ntree=500, #calculating model
                    xtest = TestingX, ytest = TestingY)

#for prediction (training set)

pred<-model$predicted

meanY<-sum(TrainingY)/length(TrainingY)

varpY<-sum((TrainingY-meanY)^2)/length(TrainingY)

mseY<-sum((TrainingY-pred)^2)/length(TrainingY)

r2<-(1-(mseY/varpY))*100

#for testing (test set)

pred_2<-model$test$predicted

meanY_2<-sum(TestingY)/length(TestingY)

varpY_2<-sum((TestingY-meanY_2)^2)/length(TestingY)

mseY_2<-sum((TestingY-pred_2)^2)/length(TestingY)

r2_2<-(1-(mseY_2/varpY_2))*100

training_set_mse<-c(model$mse[500], mseY)
training_set_rsq<-c(model$rsq[500]*100, r2)
testing_set_mse<-c(model$test$mse[500],mseY_2)
testing_set_rsq<-c(model$test$rsq[500]*100, r2_2)

c<-cbind(training_set_mse,training_set_rsq,testing_set_mse, testing_set_rsq)
rownames(c)<-c("last tree", "by hand")
c
model

As a result after running this code you will obtain a table containing values for MSE and %var explaines (also called rsq). The first line is called "last tree" and contains the values of MSE and %var explained for the 500th tree in the forest. The second line is called "by hand" and it contains results calculated in R based on the vectors model$predicted and model$test$predicted.

So, my questions are:

1- Are the predictions of the trees somehow cumulative? Or are they independent from each other? (I thought they were independent)

2- Is the last tree to be considered as an average of all the others?

3- Why are MSE and %var explained of the RF model (presented in the main board when you call model) the same as the ones from the 500th tree (see first line of table)? Do the vectors model$mse or model$rsq contain cumulative values?

After the last edit I found this post from Andy Liaw (one of the creators of the package) that says that MSE and %var explained are in fact cumulative!: https://stat.ethz.ch/pipermail/r-help/2004-April/049943.html.

1
Typical question for SO sistersite stats here.ZF007
For the next time, please spend a minute to learn how to properly format your code; be sure also to always include the relevant library imports (both done for you this time). Kudos for the reproducible example though...desertnaut

1 Answers

0
votes

Not sure I understand what your issue is; I'll give it a try nevertheless...

1- Are the predictions of the trees somehow cumulative? Or are they independent from each other? (I thought they were independent)

You thought correctly; the trees are fit independently of each other, hence their predictions are indeed independent. In fact, this is a crucial advantage of RF models, since it allows for parallel implementations.

2- Is the last tree to be considered as an average of all the others?

No; as clarified above, all trees are independent.

3- If each tree gets a prediction, how can I get the matrix with all the trees, since what I need is the MSE and % var explained for the forest?

Here is where what you ask starts being really unclear, given your code above; the MSE and r2 you say you need are exactly what you are already computing in mseY and r2:

mseY
[1] 0.1232342

r2
[1] 81.90718

which, unsurpizingly, are the very same values reported by model:

model
# result:

Call:
 randomForest(x = TrainingX, y = TrainingY, ntree = 500) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 1

          Mean of squared residuals: 0.1232342
                    % Var explained: 81.91

so I'm not sure I can really see your issue, or what these values have to do with the "matrix with all the trees"...

But when I do model$mse, the value presented corresponds to the value of MSE for the last tree calculated, and the same happens for % var explained.

Most certainly not: model$mse is a vector of length equal to the number of trees (here 500), containing the MSE for each individual tree; (see UPDATE below) I have never seen any use for this in practice (similarly for model$rsq):

length(model$mse)
[1] 500

length(model$rsq)
[1] 500

UPDATE: Kudos to the OP herself (see comments), who discovered that the quantities in model$mse and model$rsq are indeed cumulative (!); from an old (2004) thread by package maintainer Andy Liaw, Extracting the MSE and % Variance from RandomForest:

Several ways:

  1. Read ?randomForest, especially the `Value' section.
  2. Look at str(myforest.rf).
  3. Look at print.randomForest.

If the forest has 100 trees, then the mse and rsq are vectors with 100 elements each, the i-th element being the mse (or rsq) of the forest consisting of the first i trees. So the last element is the mse (or rsq) of the whole forest.