How to create a learning curve (bias/variance) from the output of caret::train

Question

I am new to the caret library. I would like to use the train function to run cross-validation on my dataset (using the rpart method for classification). My goal is is to produce learning curves using the data returned from my call to train. The learning curve would plot the dataset size on the x-axis. The error of the predictions on the training and cross validation sets would be plotted as a function of dataset size.

My question is, does caret make predictions on both the training and cv folds? If the answer is yes, how would I go about extracting that data?

Assuming the answer is yes, here is a simple code sample that you could append to to illustrate:

library(MASS)
data(biopsy)
biopsy <- biopsy[, -1]
names(biopsy) <- c("thick", "u.size", "u.shape", "adhsn", "s.size", "nucl", "chrom", "n.nuc", "mit", "class")
biopsy.v2 <- na.omit(biopsy)
set.seed(1)
ind <- sample(2, nrow(biopsy.v2), replace = TRUE, prob = c(0.7, + 0.3))
biop.train <- biopsy.v2[ind == 1, ]
tr.model <- caret::train(class ~ ., data= biop.train, trControl = trainControl(method="cv", number=4, verboseIter = FALSE, savePredictions = "final"), method='rpart')
#Can I extract train and cv accuracies from tr.model?

Thanks.

note: I realize that I may need to call train repeatedly with different samples of my dataset (assuming caret doesn't also support this), and that is not reflected in the code sample here.

missuse missuse · Accepted Answer · 2017-09-09T21:44:13

You can try this:

A data frame with predictions for each resample:

tr.model$pred

A data frame with columns for each performance metric. Each row corresponds to each resample:

tr.model$resample

A data frame with the final parameters:

tr.model$bestTune

A data frame with the training error rate and values of the tuning parameters:

tr.model$results

To specify repeated CV:

trainControl(..., repeats = n)

where n is an integer (the number of complete sets of folds to compute)

EDIT: determine which resamples were in the test folds:

the relevant information is in tr.model$pred data frame:

tr.model$pred[tr.model$pred$Resample=="Fold1",4:5]
tr.model$pred[tr.model$pred$Resample=="Fold2",4:5]
tr.model$pred[tr.model$pred$Resample=="Fold3",4:5]
tr.model$pred[tr.model$pred$Resample=="Fold4",4:5]

the ones that were not in the test folds were in the training folds

How to create a learning curve (bias/variance) from the output of caret::train

1 Answers