Cross validation: train/test set split necessary?

Question

Let's say I want to use a Random Forest model to predict future data. I'm thinking about two ways of training this model, picking the best hyperparameters, and putting this model in production. The difference between the two approaches is that the first one splits the data into a training and test set, while the second does not.

Can I use both these approaches? Is one of these better to use than the other? I guess one downside of the 2nd approach is that there is no unbiased performance estimate, but does this really matter?

1)

Split data into train and test set (80/20)
Use k-fold cross validation on the train data set.
Choose hyperparameters which perform best on the k validation sets.
Train this best model on complete training data
Get an unbiased performance estimate on test set
Train best model on complete data set
Predict future data using final model

Use k-fold cross validation on the complete data set.
Choose hyperparameters which perform best on the k validation sets.
Train best model on complete data
Predict future data using final model

minhsphuc12 minhsphuc12 · Accepted Answer · 2020-11-20T15:10:49

Cross-validation is one specific case of k-fold validation where k = (1/split_rate) - 1 and doing just 1 round of validation. So you do not need cross-validation when you already do optimization through k-fold validation.

Cross validation: train/test set split necessary?

1 Answers