
I have a training data set, let's call it: "training_data", which consists of 19 variables (features) and 1 label, total of 20 variables (columns). This data set only contains the best predictors, meaning that low variance columns or bad predictors have been removed, I mean, this is the resulting data frame from feature selection. Let's call the label in this data set: "final_score"

Also, I have a test data set, lest's call it: "predictions_data", that has the same 19 variables (features) but has no label variable, so in total, this set is 19 variables (columns).

I'm doing a very simple regression model, using a "lasso regression" from Caret's "train" method, to train the model and further predict labels ("final_score") in the "predictions_data".

My code goes as follows:

# Import training data as a data frame:

training_data <- data.frame(training_data)

# Set cross validation folds and times:

fitControl <- trainControl(method = "repeatedcv",   

                           number = 3,     # number of folds

                           repeats = 3)    # repeated three times

# Train the model using "lasso" regression from train method. I've called the model as "model.cv":

model.cv <- train(final_score ~ .,

                  data = training_data,

                  method = "lasso",

                  trControl = fitControl,

                  preProcess = c('scale', 'center')) 

So far, everything goes nice, the model shows the best results from cross validation and the metrics (RMSE, MAE, etc.) obtained.

So now, I want to apply the model to the "predictions_data", so the model can "predict" the final_score.

My code for trying to do this, is:

# Import test data set to a data frame (with no label column):

predictions_data <- data.frame(predictions_data)

# Apply the model using predict function from Caret, and save them in an object called: "predictions":

predictions <- predict(model.cv, newdata = predictions_data)

And here comes the problem. Even I stated that newdata = predictions_data, the predictions object returns the predicted labels for the training data set and not the test data set... What am I doing wrong? (well obviously this is a very basic model, but event though it should work with predictions, right?)

Thanks in advance!

Hi Jc1919, it looks correct, can you share your data, using dput(head(predictions_data,100)) and dput(head(training_data,100)) so we can try to reproduce the errorStupidWolf
I noticed you did predictions_data <- data.frame(predictions_data), is predictions_data not a dataframe to begin with?StupidWolf
Hi StupidWolf, thanks for commenting! The data.frame(training_data) code was used to convert a tibble object to data frame (maybe this "downgrade" was useless for this exercise, I guess). I found the problem. The predict function was returning the results from the training set (I guess) because the test set had some errors in it (i.e. NA's in numeric columns) as opposed to the training dataset that I prepared for training. I cleaned/prepared the test dataset correctly and it predicted with no trouble! Thanks!Jc1919
Glad it worked for you this time :) Next time provide a reproducible example with a subset of the data. Will help you.StupidWolf
I will for sure, thanks again!Jc1919

1 Answers


The test dataset had some data in incorrect format (i.e. NA's in numeric columns) as opposed to the training dataset that was cleaned/prepared for training. As soon as the test data was cleaned/prepared the predict function executed correctly.