I fitted a random forest model. I have used both randomForest
and ranger
package. I didn't tune number of trees in a forest, I just left it with default number, which is 500. Now I would like to see if it is enough, i.e. if error have reached a plateau or not. So I believe I would need to extract the individual trees, take at random for example 100, 200, 300, 400 and finally 500 trees, take oob trees out of them and calculate the OOB error for 100, 200, ... trees consecutively. Then I could plot OOB error vs the number of trees. I found that randomForest::getTree
and ranger::treeInfo
returns a data.frame of trees, but I can't figure out what is what there. What's more ranger::treeInfo
returns a data frame with 50% NAs and the overall output is even harder to read.
So my question are:
- How can I plot the OOB error vs number of trees used in a forest if I already have a random forest with 500 trees?
- Why
ranger::treeInfo
have 50% NAs and actually only those rows have predictions?
Here's minimal example:
mpg2 <- mpg %>%
mutate(is_suv = as.factor(class == 'suv')) %>%
select(-class)
mpg_model <- ranger::ranger(is_suv ~ ., data = mpg2)
ranger::treeInfo(mpg_model, tree = 100)