0
votes

I am rather new to machine learning and I am currently trying to implement a random forest classification using the caret and randomForest packages in R. I am using the trainControl function with repeated cross-validation. Maybe it is a stupid question but as far as I understand random forest usually uses bagging to split the training data into different subsets with replacement using 1/3 as a validation set based on which the OOB is calculated on. But what happens if you specify that you want to use k-fold cross-validation? From the caret documentation, I assumed that it uses only cross-validation for the resampling, But if it only used cross-validation, why do you still get an OOB error? Or is bagging still used for the creation of the model and cross-validation for the performance evaluation?

TrainingControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, savePredictions = TRUE, classProbs = TRUE, search = "grid")

train(x ~ ., data = training_set,
           method = "rf",
           metric = "Accuracy",            
           trControl = TrainingControl,
           ntree = 1000,  
           importance = TRUE              
          )
1

1 Answers

0
votes

Trying to address your questions:

random forest usually uses bagging to split the training data into different subsets with replacement using 1/3 as a validation set based on which the OOB is calculated on

Yes, caret is using randomForest() from the package randomForest, and more specifically, it bootstraps on the training data, generate multiple decision tress which are bagged, to reduce overfitting, from wiki:

This bootstrapping procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated.

So if you call k-fold cross-validation from caret, it simply runs randomForest() on different training sets, therefore the answer to this:

But what happens if you specify that you want to use k-fold cross-validation? From the caret documentation, I assumed that it uses only cross-validation for the resampling, But if it only used cross-validation, why do you still get an OOB error?

Would be the sampling and bagging is performed because it is part of randomforest. caret simply repeats this on different training set and estimates the error on their respective test set. The OOB error generated from randomForest() stays regardless. The difference is that you have a truly "unseen" data that can be used to evaluate your model.