Does predict function in caret package use future information when preprocessing?

Question

My question is pretty simple but I can't find a clear cut answer using caret package doc. If I use the preprocessing options center and scale in my train function, it is stated that the same preprocesing will be applied to new data set while doing predictions.

So when I use the predict function: Does it mean that mean and scale of the training set is applied to the new data? Or a new centering and scaling is applied to the new data set, thus potentially using points in the future if the data are timeseries (which is problematic)?

Thank you

Are you talking about caret::predict.preProcess()? If so, the documentation says the transformation uses estimates from the training data to center/scale the test data. — ddunn801
I am talking about predict.train, when you have trained a model and want to use it on a new data set. — mlal

ddunn801 ddunn801 · Accepted Answer · 2016-09-13T14:47:31

caret::predict.train uses parameters from the model you built to predict on the test set.

Here is a snippet from the source code that shows the preProc data comes from the object's preProcess parameters:

out <- predictionFunction(method = object$modelInfo, 
            modelFit = object$finalModel, newdata = newdata, 
            preProc = object$preProcess)

You can see these parameters for yourself after creating your model by accessing object$preProcess. Here is a complete example:

rm(list=ls())
library(caret)
set.seed(4444)

data(mtcars)
inTrain <- createDataPartition(y=mtcars$mpg,p=0.75,list=FALSE)
training <- mtcars[inTrain,]
testing <- mtcars[-inTrain,]

lmFit <- train(mpg~.,data=training,method="lm",preProc=c("center","scale"))
lmFit$preProcess

Does predict function in caret package use future information when preprocessing?

1 Answers