4
votes

I have a question about data preprocess that need to be clarified. To my understanding, when we tune hyperparameters and estimate model performance via cross-validation, rather than preprocess the whole dataset, we need to do that within cross-validation. In other words, in cross-validation, we preprocess training folds, then use the same preprocess parameter to process test fold and make predictions.

In the example code below, when I specify the preProcess within caret::train, does it automatically do that? Really appreciate it if someone can clarify me on that.

From some online sources, some people preprocess the whole dataset (trainset) and then use the preprocess data to tune hyperparameters via cross-validation, it does not seems to be right....

library(caret)
library(mlbench)
data(PimaIndiansDiabetes)

control <- trainControl(method="cv", 
                        number=5,
                        preProcOptions = list(pcaComp=4))
grid=expand.grid(mtry=c(1,2,3))

model <- train(diabetes~., data=PimaIndiansDiabetes, method="rf", 
               preProcess=c("scale", "center", "pca"), 
               trControl=control,
               tuneGrid=grid)
1

1 Answers

3
votes

Your worries are on the right spot. So many ways to introduce positive bias.

According to Max Kuhn the creator of caret there is no data leakage when preProcess is specified in train:

All pre-processing is applied on the resampled version of the data (e.g. 90% in 10-fold CV) and then those calculations are applied to the holdouts (the remaining 10%) with no re-calculation.

source: https://github.com/topepo/caret/issues/335