2
votes

I did cross validation on my data using Random Forest method in Caret package, R says that the final model is built using mtry=34, does it mean that in the final Random Forest (resulted from cross-validation) only 34 variables of the parameters in my data set were used for splitting in trees?

> output
Random Forest 

 375 samples
  592 predictors
  2 classes: 'alzheimer', 'control' 

  No pre-processing
  Resampling: Cross-Validated (3 fold) 
  Summary of sample sizes: 250, 250, 250 
  Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
  2   0.6826667  0.3565541
  34   0.7600000  0.5194246
  591   0.7173333  0.4343563

   Accuracy was used to select the optimal model using  the largest value.
   The final value used for the model was mtry = 34.
2

2 Answers

5
votes

Since you've built your random forest using the caret package, a tip is to use the $finalModel to obtain the summary of your final model, which is the model that is selected using a pre-defined parameter (default: OOB Accuracy).

Now to answer your question:

From the image below, you can see the random forest randomly chooses from 34 (my example is 31, but you get the point) variables on each split. This is not to be confused with using only 34 variables to grow each tree, as per your question. In fact, all variables are used in a sufficiently large random forest; Only, on each node, one variable is picked from a pool of 34 to reduce variance of the model. This makes each tree more independent from each another and consequently, the gains from averaging over a large number of trees more significant.

enter image description here

The tree-growing process for each tree is as follow (bold for emphasis, and assuming you're using the randomForest implementation from caret or from randomForest directly):

  • For a dataset with N x M dimension (N for observations, M for number of variables), sample (~two-third) of N with replacement from the original data and use this new sample as training set with the observations left out (~one-third) used as a test set
  • A number m (smaller than M) is specified such that at each node split, m variables are selected at random out of the M and the best candidate out of m (measured by information gain) is used to split the node. m is a constant during the forest growing
  • Each tree is grown to the largest extent possible without pre- or post-pruning

Sorry for the 2 month late answer, but I thought this is a great question and a shame if it doesn't get a more elaborated explanation about what the mtry parameter truly does. It's quite often misunderstood so I thought I would add an answer here!

1
votes

Documentation of randomForest:

mtry: Number of variables randomly sampled as candidates at each split.

In this case the final model considers 34 random variables per split in the tree.