1
votes

In caret, can you derive the predictors used to train a model when the algorithm optimizes from among many?

I've delegated preprocessing to caret for an assignment, since I know I won't be able to tease apart the data. In a random forest as I understand it, the predictors are a varied subset at each branch of the decision tree.

Given that mtry is

Number of variables available for splitting at each tree node.

and a summary of

Resampling results across tuning parameters:

  mtry  Accuracy   Kappa      Accuracy SD   Kappa SD   
   2    0.9944614  0.9929903  0.0010947590  0.001386114
  28    0.9979948  0.9974629  0.0009365892  0.001183031
  55    0.9957888  0.9946703  0.0019214403  0.002432008

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 28. 

I'd like to know what features were culled and which were useful (particularly the two that yielded 99.4% accuracy

model <- train(classe ~ ., method="rf", data=trainPre, 
                     prox=TRUE,allowParallel=TRUE)

> summary(result$model)
                Length    Class      Mode     
call                    5 -none-     call     
type                    1 -none-     character
predicted           15699 factor     numeric  
err.rate             3000 -none-     numeric  
confusion              30 -none-     numeric  
votes               78495 matrix     numeric  
oob.times           15699 -none-     numeric  
classes                 5 -none-     character
importance             58 -none-     numeric  
importanceSD            0 -none-     NULL     
localImportance         0 -none-     NULL     
proximity       246458601 -none-     numeric  
ntree                   1 -none-     numeric  
mtry                    1 -none-     numeric  
forest                 14 -none-     list     
y                   15699 factor     numeric  
test                    0 -none-     NULL     
inbag                   0 -none-     NULL     
xNames                 58 -none-     character
problemType             1 -none-     character
tuneValue               1 data.frame list     
obsLevels               5 -none-     character
> result3$model

Are these predictors squirreled away somewhere in the model object?

1
See the importance function from the randomForest package. Also varImpPlot.eipi10

1 Answers

1
votes

There is a class called predictors for just this purpose.

However, a few caveats:

  • note that this tells you which predictors are functionally part of the prediction equation. In most cases, when making predictions on a new data set, you still need the others in-place due to assumptions made by the package authors (often, the predictors are stored by column index and not name so everything is needed).
  • there is currently a bug in randomForest that prevents this from working with the formula method. I submitted a bug request to Andy in February about it so I'll send him a reminder.
  • random forest will include a lot of irrelevant predictors in this list. If mtry randomly exposes the splitting routine to all non informative predictors, then they will be in the lists.

An example:

> library(caret)
> 
> set.seed(135)
> tr_dat <- twoClassSim(100)
> 
> set.seed(417)
> mod <- train(x = tr_dat[, -ncol(tr_dat)], y = tr_dat$Class, method = "rf")
> 
> predictors(mod)
[1] "TwoFactor1" "TwoFactor2" "Linear01"   "Linear02"   "Linear03"   "Linear04"  
[7] "Linear05"   "Linear06"   "Linear07"   "Linear08"   "Linear09"   "Linear10"  
[13] "Nonlinear1" "Nonlinear2" "Nonlinear3"

Max