1
votes

I am currently trying to optimize the random forest classifier for a very high-dimensional dataset (p > 200k) using recursive feature elimination (RFE). caret package has a nice implementation for doing this (rfe()-function). However, I am also thinking about optimizing RAM and CPU usage.. That's why I wonder if there is an opportunity to set different (larger) number of trees to train the first forest (without feature elimination) and to use its importances to build the remaining ones (with RFE) using for example 500 trees with 10- or 5-fold cross-validation. I know that this option is available in varSelRF.. But how about caret? I didn't manage to find anything regarding this in the manual.

1
I am not sure I get what you are after. Do you want to 1) train a random forest on all data, and 2) use its importance estimates to reduce the number of features prior to 3) doing RFE? - Backlin
Dear @Backlin. Correct me if I am wrong, but I thought that when performing RFE, you're actually training a new forest at each step. Therefore, if you have N steps (gradually removing for example 10%, 20%,.. of features) - you need to train N forests.. It is becoming even more expensive if you are going to do cross-validation for each step (N * n-folds). I wonder if it is possible perform RFE in caret using reduced number of trees (500) based on importances extracted from the "bigger forest" (10k of trees). Does it make sense? - sharky
Oh sorry, your question makes complete sense, I was just confused. - Backlin

1 Answers

4
votes

You can do that. The rfFuncs list has an object called fit that defines how the model is fit. One argument to this function is called 'first' which is TRUE on the first fit (there is also a 'last' arg). You can set ntree based on this.

See the feature selection vignette for more details.

Max