0
votes

I'm using caret to find & compare predictions for multiple models. I'm first partitioning my data into 5 cross-validation folds, then using 10-fold CV within each of the 5 training datasets to select optimal model parameters.

Example code on a small (n=400) test dataset for a single glmnet model:

# Load data & factor admit variable.
> mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
> mydata$admit <- as.factor(mydata$admit)

# Create levels yes/no to make sure the the classprobs get a correct name.
 levels(mydata$admit) = c("yes", "no")

# Partition data into 5 folds.
> set.seed(123)
> folds <- createFolds(mydata$admit, k=5)

# Train elastic net logistic regression via 10-fold CV on each of 5 training folds using index argument.
> set.seed(123)
> train_control <- trainControl( method="cv",
 number=10,
 index=folds,
 classProbs = TRUE,
 savePredictions = TRUE)

> glmnetGrid <- expand.grid(alpha=c(0, .5, 1), lambda=c(.1, 1, 10))
 model<- train(admit ~ .,
 data=mydata,
 trControl=train_control,
 method="glmnet",
 family="binomial",
 tuneGrid=glmnetGrid,
 metric="Accuracy",
 preProcess=c("center","scale"))

> model
glmnet 

400 samples
  3 predictor
  2 classes: 'yes', 'no' 

Pre-processing: centered (3), scaled (3) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 79, 80, 80, 81, 80 
Resampling results across tuning parameters:

  alpha  lambda  Accuracy      Kappa          Accuracy SD     Kappa SD     
  0.0     0.1    0.6918972780  0.08970669720  0.016425551472  0.08416581606
  0.0     1.0    0.6825007141  0.00000000000  0.001368477994  0.00000000000
  0.0    10.0    0.6825007141  0.00000000000  0.001368477994  0.00000000000
  0.5     0.1    0.6818893800  0.04127002380  0.008252409699  0.04052581228
  0.5     1.0    0.6825007141  0.00000000000  0.001368477994  0.00000000000
  0.5    10.0    0.6825007141  0.00000000000  0.001368477994  0.00000000000
  1.0     0.1    0.6800085023  0.02149826881  0.005876570847  0.04807159045
  1.0     1.0    0.6825007141  0.00000000000  0.001368477994  0.00000000000
  1.0    10.0    0.6825007141  0.00000000000  0.001368477994  0.00000000000

Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were alpha = 0 and lambda = 0.1. 
> summary(model$pred)
  pred        obs          rowIndex           yes                  no                 alpha         lambda       Resample        
 yes:14192   yes:9828   Min.   :  1.00   Min.   :0.2650250   Min.   :0.03333769   Min.   :0.0   Min.   : 0.1   Length:14400      
 no :  208   no :4572   1st Qu.:100.75   1st Qu.:0.6750000   1st Qu.:0.31250000   1st Qu.:0.0   1st Qu.: 0.1   Class :character  
                        Median :200.50   Median :0.6835443   Median :0.31645570   Median :0.5   Median : 1.0   Mode  :character  
                        Mean   :200.50   Mean   :0.6840322   Mean   :0.31596777   Mean   :0.5   Mean   : 3.7                     
                        3rd Qu.:300.25   3rd Qu.:0.6875000   3rd Qu.:0.32500000   3rd Qu.:1.0   3rd Qu.:10.0                     
                        Max.   :400.00   Max.   :0.9666623   Max.   :0.73497501   Max.   :1.0   Max.   :10.0                     

Question: Does caret syntax allow me to obtain the 5 test fold predictions for the corresponding best-fitting models for each of the 5 training fold partitions?

As it is, model$pred returns 14,400 predictions and the best fitting model for the entire dataset. I'd like model$pred to return n = 5 x 80 = 400 predictions for the 5 separate models fitted to each training fold.

1
Great question. Just to avoid confusion, can you replace "test fold" by "hold-out folds", as it used in the caret::trainControl documentation: savePredictions: an indicator of how much of the hold-out predictions for each resample should be saved. The terms "test" and "testing set" should be reserved for the method of data splitting into training set and testing set.Agile Bean

1 Answers

1
votes

You just need to set savePredictions = "final". That should limit the output to what you need.