I have a question about data preprocess that need to be clarified. To my understanding, when we tune hyperparameters and estimate model performance via cross-validation, rather than preprocess the whole dataset, we need to do that within cross-validation. In other words, in cross-validation, we preprocess training folds, then use the same preprocess parameter to process test fold and make predictions.
In the example code below, when I specify the preProcess within caret::train, does it automatically do that? Really appreciate it if someone can clarify me on that.
From some online sources, some people preprocess the whole dataset (trainset) and then use the preprocess data to tune hyperparameters via cross-validation, it does not seems to be right....
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
control <- trainControl(method="cv",
number=5,
preProcOptions = list(pcaComp=4))
grid=expand.grid(mtry=c(1,2,3))
model <- train(diabetes~., data=PimaIndiansDiabetes, method="rf",
preProcess=c("scale", "center", "pca"),
trControl=control,
tuneGrid=grid)