I am new to Random Forests and I have a question about regression. I am using R package randomForests to calculate RF models.
My final goal is to select sets of variables important for prediction of a continuous trait, and so I am calculating a model, then I remove the variable with lowest mean decrease in accuracy, and I calculate a new model, and so on. This worked with RF classification, and I compared the models using the OOB errors from prediction (training set), development and validation data sets. Now with regression I want to compare the models based on %variation explained and MSE.
I was evaluating the results for MSE and %var explained, and I get exactly the same results when calculating manually using the prediction from model$predicted. But when I do model$mse, the value presented corresponds to the value of MSE for the last tree calculated, and the same happens for % var explained.
As an example you can try this code in R:
library(randomForest)
data("iris")
head(iris)
TrainingX<-iris[1:100,2:4] #creating training set - X matrix
TrainingY<-iris[1:100,1] #creating training set - Y vector
TestingX<-iris[101:150,2:4] #creating test set - X matrix
TestingY<-iris[101:150,1] #creating test set - Y vector
set.seed(2)
model<-randomForest(x=TrainingX, y= TrainingY, ntree=500, #calculating model
xtest = TestingX, ytest = TestingY)
#for prediction (training set)
pred<-model$predicted
meanY<-sum(TrainingY)/length(TrainingY)
varpY<-sum((TrainingY-meanY)^2)/length(TrainingY)
mseY<-sum((TrainingY-pred)^2)/length(TrainingY)
r2<-(1-(mseY/varpY))*100
#for testing (test set)
pred_2<-model$test$predicted
meanY_2<-sum(TestingY)/length(TestingY)
varpY_2<-sum((TestingY-meanY_2)^2)/length(TestingY)
mseY_2<-sum((TestingY-pred_2)^2)/length(TestingY)
r2_2<-(1-(mseY_2/varpY_2))*100
training_set_mse<-c(model$mse[500], mseY)
training_set_rsq<-c(model$rsq[500]*100, r2)
testing_set_mse<-c(model$test$mse[500],mseY_2)
testing_set_rsq<-c(model$test$rsq[500]*100, r2_2)
c<-cbind(training_set_mse,training_set_rsq,testing_set_mse, testing_set_rsq)
rownames(c)<-c("last tree", "by hand")
c
model
As a result after running this code you will obtain a table containing values for MSE and %var explaines (also called rsq). The first line is called "last tree" and contains the values of MSE and %var explained for the 500th tree in the forest. The second line is called "by hand" and it contains results calculated in R based on the vectors model$predicted and model$test$predicted.
So, my questions are:
1- Are the predictions of the trees somehow cumulative? Or are they independent from each other? (I thought they were independent)
2- Is the last tree to be considered as an average of all the others?
3- Why are MSE and %var explained of the RF model (presented in the main board when you call model) the same as the ones from the 500th tree (see first line of table)? Do the vectors model$mse or model$rsq contain cumulative values?
After the last edit I found this post from Andy Liaw (one of the creators of the package) that says that MSE and %var explained are in fact cumulative!: https://stat.ethz.ch/pipermail/r-help/2004-April/049943.html.