I'm using caret to find & compare predictions for multiple models. I'm first partitioning my data into 5 cross-validation folds, then using 10-fold CV within each of the 5 training datasets to select optimal model parameters.
Example code on a small (n=400) test dataset for a single glmnet
model:
# Load data & factor admit variable.
> mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
> mydata$admit <- as.factor(mydata$admit)
# Create levels yes/no to make sure the the classprobs get a correct name.
levels(mydata$admit) = c("yes", "no")
# Partition data into 5 folds.
> set.seed(123)
> folds <- createFolds(mydata$admit, k=5)
# Train elastic net logistic regression via 10-fold CV on each of 5 training folds using index argument.
> set.seed(123)
> train_control <- trainControl( method="cv",
number=10,
index=folds,
classProbs = TRUE,
savePredictions = TRUE)
> glmnetGrid <- expand.grid(alpha=c(0, .5, 1), lambda=c(.1, 1, 10))
model<- train(admit ~ .,
data=mydata,
trControl=train_control,
method="glmnet",
family="binomial",
tuneGrid=glmnetGrid,
metric="Accuracy",
preProcess=c("center","scale"))
> model
glmnet
400 samples
3 predictor
2 classes: 'yes', 'no'
Pre-processing: centered (3), scaled (3)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 79, 80, 80, 81, 80
Resampling results across tuning parameters:
alpha lambda Accuracy Kappa Accuracy SD Kappa SD
0.0 0.1 0.6918972780 0.08970669720 0.016425551472 0.08416581606
0.0 1.0 0.6825007141 0.00000000000 0.001368477994 0.00000000000
0.0 10.0 0.6825007141 0.00000000000 0.001368477994 0.00000000000
0.5 0.1 0.6818893800 0.04127002380 0.008252409699 0.04052581228
0.5 1.0 0.6825007141 0.00000000000 0.001368477994 0.00000000000
0.5 10.0 0.6825007141 0.00000000000 0.001368477994 0.00000000000
1.0 0.1 0.6800085023 0.02149826881 0.005876570847 0.04807159045
1.0 1.0 0.6825007141 0.00000000000 0.001368477994 0.00000000000
1.0 10.0 0.6825007141 0.00000000000 0.001368477994 0.00000000000
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0 and lambda = 0.1.
> summary(model$pred)
pred obs rowIndex yes no alpha lambda Resample
yes:14192 yes:9828 Min. : 1.00 Min. :0.2650250 Min. :0.03333769 Min. :0.0 Min. : 0.1 Length:14400
no : 208 no :4572 1st Qu.:100.75 1st Qu.:0.6750000 1st Qu.:0.31250000 1st Qu.:0.0 1st Qu.: 0.1 Class :character
Median :200.50 Median :0.6835443 Median :0.31645570 Median :0.5 Median : 1.0 Mode :character
Mean :200.50 Mean :0.6840322 Mean :0.31596777 Mean :0.5 Mean : 3.7
3rd Qu.:300.25 3rd Qu.:0.6875000 3rd Qu.:0.32500000 3rd Qu.:1.0 3rd Qu.:10.0
Max. :400.00 Max. :0.9666623 Max. :0.73497501 Max. :1.0 Max. :10.0
Question: Does caret syntax allow me to obtain the 5 test fold predictions for the corresponding best-fitting models for each of the 5 training fold partitions?
As it is, model$pred
returns 14,400 predictions and the best fitting model for the entire dataset. I'd like model$pred
to return n = 5 x 80 = 400 predictions for the 5 separate models fitted to each training fold.
caret::trainControl
documentation:savePredictions: an indicator of how much of the hold-out predictions for each resample should be saved.
The terms "test" and "testing set" should be reserved for the method of data splitting into training set and testing set. – Agile Bean