0
votes

I am having trouble understanding which datasets: training, validation, and test need to be used for the model selection phase vs the Final Model testing phase. I try to explain as much of it in detail below while posting reproducible code at the bottom. Thank you for any and all advice / suggestions!

Let's say we use the open "Life Expectancy (WHO)" dataset available on Kaggle to create predictions on the feature Life expectancy while using RMSE as our measurement of error. (I am asking more so about the concepts behind CV here rather than targeting the lowest RMSE). We first partition a training and test set led_train and led_test from the original dataset led.

Next we create a linear model with y = Life expectancy and x = GDP with data = led_train and do the same for random forest and knn models using repeated cross validation using the Caret Package. We then run predictions with the newly created models and led_test. The RMSE can be calculated using a function of true vs predicted ratings.

I now have RMSEs of Linear Model = 9.81141, Random Forest = 9.828415, kNN = 8.923281 on the test set. Based on these values, I would obviously select the kNN Model to be my "Final Model," however I am not sure how to test it on new "unseen" data to see how well it actually performs.

Do I need to split "led" into 3 sets (training, validation, and test) then use validation for the model selection phase, saving test for the "Final Model?" Additionally, if I choose the kNN model, would I change the data inside the train function = led_train to led so that it is run on ALL of the data, after which I use the led_test for the prediction? In the Final Model, would I again set trControl and run cross validation or is this no longer necessary because this was done on the training data? Please find my reproducible code posted below (you will have to read in the .csv according to your wd) and thank you again for taking a look!

*The seed is set to 123 for reproducibility and I am running R 3.63.

library(pacman)
pacman::p_load(readr, caret, tidyverse, dplyr)

# Download the dataset:
download.file("https://raw.githubusercontent.com/christianmckinnon/StackQ/master/LifeExpectancyData.csv", "LifeExpectancyData.csv")

# Read in the data:
led <-read_csv("LifeExpectancyData.csv")

# Check for NAs
sum(is.na(led))
# Set all NAs to 0
led[is.na(led)] <- 0

# Rename `Life expectancy` to life_exp to avoid using spaces
led <-led %>% rename(life_exp = `Life expectancy`)

# Partition training and test sets
set.seed(123, sample.kind = "Rounding")
test_index <- createDataPartition(y = led$life_exp, times = 1, p = 0.2, list = F)
led_train <- led[-test_index,]
led_test <- led[test_index,]

# Add RMSE as unit of error measurement
RMSE <-function(true_ratings, predicted_ratings){
  sqrt(mean((true_ratings - predicted_ratings)^2))
}

# Create a linear model
led_lm <- lm(life_exp ~ GDP, data = led_train)
# Create prediction
lm_preds <-predict(led_lm, led_test)
# Check RMSE
RMSE(led_test$life_exp, lm_preds)
# The linear Model achieves an RMSE of 9.81141

# Create a Random Forest Model with Repeated Cross Validation
led_cv <- trainControl(method = "repeatedcv", number = 5, repeats = 3,
                      search = "random")
# Set the seed for reproducibility:
set.seed(123, sample.kind = "Rounding")
train_rf <- train(life_exp ~ GDP, data = led_train,
                  method = "rf", ntree = 150, trControl = led_cv,
                  tuneLength = 5, nSamp = 1000, 
                  preProcess = c("center","scale"))
# Create Prediction
rf_preds <-predict(train_rf, led_test)
# Check RMSE
RMSE(led_test$life_exp, rf_preds)
# The rf Model achieves an RMSE of 9.828415

# kNN Model:
knn_cv <-trainControl(method = "repeatedcv", repeats = 1)
# Set the seed for reproducibility:
set.seed(123, sample.kind = "Rounding")
train_knn <- train(life_exp ~ GDP, method = "knn", data = led_train,
                   tuneLength = 10, trControl = knn_cv,
                   preProcess = c("center","scale"))
# Create the Prediction:
knn_preds <-predict(train_knn, led_test)
# Check the RMSE:
RMSE(led_test$life_exp, knn_preds)
# The kNN model achieves the lowest RMSE of 8.923281
1
Thie example is not reproducible as led is not definedRobert Wilson
@RobertWilson You are absolutely correct -- thank you for pointing this out. I have edited the code to read in the csv and hope that it is now reproducible!Christian McKinnon

1 Answers

0
votes

My approach would be the following. The final model should use all of the data. I am not sure what would motivate not including all data in the final model. You are just throwing away predictive power.

For cross validation, just split the data into training and test data. Then choose the modelling method with the best performance for the full model, and then create the complete model.

The bigger problem with the current code is that the cross validation method is likely to result in two things: spurious accuracy and potentially spurious model comparisons. You need to deal with temporal autocorrelation in the cross validation. For example, if my training dataset has features for the UK for 2014 and 2016, you expect something like a random forest to be able to predict life expectancy for 2015 with high accuracy. And that is potentially all you are measuring with the current type of cross validation. Better to create a segregated dataset so that the countries in training and test are different, or splitting it into clearly distinct time periods. The exact approach would depend on exactly what you want the model to predict