Let's say I want to use a Random Forest model to predict future data. I'm thinking about two ways of training this model, picking the best hyperparameters, and putting this model in production. The difference between the two approaches is that the first one splits the data into a training and test set, while the second does not.
Can I use both these approaches? Is one of these better to use than the other? I guess one downside of the 2nd approach is that there is no unbiased performance estimate, but does this really matter?
1)
- Split data into train and test set (80/20)
- Use k-fold cross validation on the train data set.
- Choose hyperparameters which perform best on the k validation sets.
- Train this best model on complete training data
- Get an unbiased performance estimate on test set
- Train best model on complete data set
- Predict future data using final model
- Use k-fold cross validation on the complete data set.
- Choose hyperparameters which perform best on the k validation sets.
- Train best model on complete data
- Predict future data using final model