PCA preprocess parameter in caret's train function

Question

I am conducting knn regression on my data, and would like to:

a) cross-validate through repeatedcv to find an optimal k;

b) when building knn model, using PCA at 90% level threshold to reduce dimensionality.

library(caret)
library(dplyr)
set.seed(0)
data = cbind(rnorm(20, 100, 10), matrix(rnorm(400, 10, 5), ncol = 20)) %>% 
  data.frame()
colnames(data) = c('True', paste0('Day',1:20))
tr = data[1:15, ] #training set
tt = data[16:20,] #test set

train.control = trainControl(method = "repeatedcv", number = 5, repeats=3)
k = train(True ~ .,
          method     = "knn",
          tuneGrid   = expand.grid(k = 1:10), 
          #trying to find the optimal k from 1:10
          trControl  = train.control, 
          preProcess = c('scale','pca'),
          metric     = "RMSE",
          data       = tr)

My questions:

(1) I notice that someone suggested to change the pca parameter in trainControl:

ctrl <- trainControl(preProcOptions = list(thresh = 0.8))
mod <- train(Class ~ ., data = Sonar, method = "pls",
              trControl = ctrl)

If I change the parameter in the trainControl, does it mean the PCA is still conducted during the KNN? Similar concern as this question

(2) I found another example which fits my situation - I am hoping to change the threshold to 90% but I don't know where can I change it in Caret's train function, especially I still need the scale option.

I apologize for my tedious long description and random references. Thank you in advance!

(Thank you Camille for the suggestions to make the code work!)

Don't have a ton of experience with caret, but it looks like preProcess should be an argument to train, not a function. Change preProcess(c('scale','pca')) to preProcess = c('scale','pca') — camille

StupidWolf StupidWolf · Accepted Answer · 2020-06-15T22:00:18

To answer your questions:

I notice that someone suggested to change the pca parameter in trainControl:

mod <- train(Class ~ ., data = Sonar, method = "pls",trControl = ctrl)

If I change the parameter in the trainControl, does it mean the PCA is still conducted during the KNN?

Yes if you do it with:

train.control = trainControl(method = "repeatedcv", number = 5, repeats=3,preProcOptions = list(thresh = 0.9))

k = train(True ~ .,
          method     = "knn",
          tuneGrid   = expand.grid(k = 1:10), 
          trControl  = train.control, 
          preProcess = c('scale','pca'),
          metric     = "RMSE",
          data       = tr)

You can check under preProcess:

k$preProcess
Created from 15 samples and 20 variables

Pre-processing:
  - centered (20)
  - ignored (0)
  - principal component signal extraction (20)
  - scaled (20)

PCA needed 9 components to capture 90 percent of the variance

This will answer 2) which is to use preProcess separately:

mdl = preProcess(tr[,-1],method=c("scale","pca"),thresh=0.9)
mdl
Created from 15 samples and 20 variables

Pre-processing:
  - centered (20)
  - ignored (0)
  - principal component signal extraction (20)
  - scaled (20)

PCA needed 9 components to capture 90 percent of the variance

train.control = trainControl(method = "repeatedcv", number = 5, repeats=3)

k = train(True ~ .,
          method     = "knn",
          tuneGrid   = expand.grid(k = 1:10), 
          trControl  = train.control,
          metric     = "RMSE",
          data       = predict(mdl,tr))

PCA preprocess parameter in caret's train function

1 Answers