0
votes

Let's say I want to use a Random Forest model to predict future data. I'm thinking about two ways of training this model, picking the best hyperparameters, and putting this model in production. The difference between the two approaches is that the first one splits the data into a training and test set, while the second does not.

Can I use both these approaches? Is one of these better to use than the other? I guess one downside of the 2nd approach is that there is no unbiased performance estimate, but does this really matter?

1)

  • Split data into train and test set (80/20)
  • Use k-fold cross validation on the train data set.
  • Choose hyperparameters which perform best on the k validation sets.
  • Train this best model on complete training data
  • Get an unbiased performance estimate on test set
  • Train best model on complete data set
  • Predict future data using final model
  • Use k-fold cross validation on the complete data set.
  • Choose hyperparameters which perform best on the k validation sets.
  • Train best model on complete data
  • Predict future data using final model
1

1 Answers

0
votes

Cross-validation is one specific case of k-fold validation where k = (1/split_rate) - 1 and doing just 1 round of validation. So you do not need cross-validation when you already do optimization through k-fold validation.