2
votes

I want to perform leave subject out cross validation with R caret (cf. this example) but only use a subset of the data in training for creating CV models. Still, the left out CV partition should be used as a whole, as I need to test on all data of a left out subject (no matter if it's millions of samples that cannot be used in training due to computational restrictions).

I've created a minimal 2 class classification example using the subset and index parameters of caret::train and caret::trainControl to achieve this. From my observation this should solve the problem, but I have a hard time actually ensuring that the evaluation is still done in a leave-subject-out way. Maybe someone with experience in this task could shed some light on this:

library(plyr)
library(caret)
library(pROC)
library(ggplot2)

# with diamonds we want to predict cut and look at results for different colors = subjects
d <- diamonds
d <- d[d$cut %in% c('Premium', 'Ideal'),] # make a 2 class problem
d$cut <- factor(d$cut)
indexes_data <- c(1,5,6,8:10)
indexes_labels <- 2 

# population independent CV indexes for trainControl
index <- llply(unique(d[,3]), function(cls) c(which(d[,3]!=cls)))
names(index) <- paste0('sub_', unique(d[,3]))
str(index) # indexes used for training models with CV = OK

m3 <- train(x = d[,indexes_data], 
            y = d[,indexes_labels], 
            method = 'glm', 
            metric = 'ROC',
            subset = sample(nrow(d), 5000), # does this subset the data used for training and obtaining models, but not the left out partition used for estimating CV performance?
            trControl = trainControl(returnResamp = 'final', 
                                    savePredictions = T,
                                    classProbs = T, 
                                    summaryFunction = twoClassSummary, 
                                    index = index))
str(m3$resample) # all samples used once = OK

# performance over all subjects
myRoc <- roc(predictor = m3$pred[,3], response = m3$pred$obs)

plot(myRoc, main = 'all')

performance for individual subjects

l_ply(unique(m3$pred$Resample), .fun = function(cls) { pred_sub <- m3$pred[m3$pred$Resample==cls,] myRoc <- roc(predictor = pred_sub[,3], response = pred_sub$obs) plot(myRoc, main = cls) } )

Thanks for your time!

1

1 Answers

1
votes

Using both the index and indexOut parameter in caret::trainControl at the same time seems to do the trick (thanks to Max for the hint in this question). Here is the updated code:

library(plyr)
library(caret)
library(pROC)
library(ggplot2)
str(diamonds)

# with diamonds we want to predict cut and look at results for different colors = subjects
d <- diamonds
d <- d[d$cut %in% c('Premium', 'Ideal'),] # make a 2 class problem
d$cut <- factor(d$cut)
indexes_data <- c(1,5,6,8:10)
indexes_labels <- 2 

# population independent CV partitions for training and left out partitions for evaluation
indexes_populationIndependence_subjects <- 3
index <- llply(unique(d[,indexes_populationIndependence_subjects]), function(cls) c(which(d[,indexes_populationIndependence_subjects]!=cls)))
names(index) <- paste0('sub_', unique(d[,indexes_populationIndependence_subjects]))
indexOut <- llply(index, function(part) (1:nrow(d))[-part])
names(indexOut) <- paste0('sub_', unique(d[,indexes_populationIndependence_subjects]))
# subsample partitions for training
index <- llply(index, function(i) sample(i, 1000))

m3 <- train(x = d[,indexes_data], 
            y = d[,indexes_labels], 
            method = 'glm', 
            metric = 'ROC',
            trControl = trainControl(returnResamp = 'final',
                                    savePredictions = T,
                                    classProbs = T, 
                                    summaryFunction = twoClassSummary, 
                                    index = index, 
                                    indexOut = indexOut))
m3$resample # seems OK
str(m3$pred) # seems OK
myRoc <- roc(predictor = m3$pred[,3], response = m3$pred$obs)
plot(myRoc, main = 'all')
# analyze results per subject
l_ply(unique(m3$pred$Resample), .fun = function(cls) {
    pred_sub <- m3$pred[m3$pred$Resample==cls,]
    myRoc <- roc(predictor = pred_sub[,3], response = pred_sub$obs)
    plot(myRoc, main = cls)
} )

Still, I'm not absolutely sure if this is actually does the estimation in a population independent way, so if anybody has knowledge about the details please share your thoughts!