3
votes

My question is pretty simple but I can't find a clear cut answer using caret package doc. If I use the preprocessing options center and scale in my train function, it is stated that the same preprocesing will be applied to new data set while doing predictions.

So when I use the predict function: Does it mean that mean and scale of the training set is applied to the new data? Or a new centering and scaling is applied to the new data set, thus potentially using points in the future if the data are timeseries (which is problematic)?

Thank you

1
Are you talking about caret::predict.preProcess()? If so, the documentation says the transformation uses estimates from the training data to center/scale the test data.ddunn801
I am talking about predict.train, when you have trained a model and want to use it on a new data set.mlal

1 Answers

1
votes

caret::predict.train uses parameters from the model you built to predict on the test set.

Here is a snippet from the source code that shows the preProc data comes from the object's preProcess parameters:

out <- predictionFunction(method = object$modelInfo, 
            modelFit = object$finalModel, newdata = newdata, 
            preProc = object$preProcess)

You can see these parameters for yourself after creating your model by accessing object$preProcess. Here is a complete example:

rm(list=ls())
library(caret)
set.seed(4444)

data(mtcars)
inTrain <- createDataPartition(y=mtcars$mpg,p=0.75,list=FALSE)
training <- mtcars[inTrain,]
testing <- mtcars[-inTrain,]

lmFit <- train(mpg~.,data=training,method="lm",preProc=c("center","scale"))
lmFit$preProcess