Since you've built your random forest using the caret
package, a tip is to use the $finalModel
to obtain the summary of your final model, which is the model that is selected using a pre-defined parameter (default: OOB Accuracy).
Now to answer your question:
From the image below, you can see the random forest randomly chooses from 34 (my example is 31, but you get the point) variables on each split. This is not to be confused with using only 34 variables to grow each tree, as per your question. In fact, all variables are used in a sufficiently large random forest; Only, on each node, one variable is picked from a pool of 34 to reduce variance of the model. This makes each tree more independent from each another and consequently, the gains from averaging over a large number of trees more significant.
The tree-growing process for each tree is as follow (bold for emphasis, and assuming you're using the randomForest
implementation from caret
or from randomForest
directly):
- For a dataset with N x M dimension (N for observations, M for number of variables), sample (~two-third) of N with replacement from the original data and use this new sample as training set with the observations left out (~one-third) used as a test set
- A number
m
(smaller than M
) is specified such that at each node split, m
variables are selected at random out of the M
and the best candidate out of m
(measured by information gain) is used to split the node. m
is a constant during the forest growing
- Each tree is grown to the largest extent possible without pre- or post-pruning
Sorry for the 2 month late answer, but I thought this is a great question and a shame if it doesn't get a more elaborated explanation about what the mtry
parameter truly does. It's quite often misunderstood so I thought I would add an answer here!