I am trying to understand the process of model evaluation and validation in machine learning. Specifically, in which order and how the training, validation and test sets must be used.
Let's say I have a dataset and I want to use linear regression. I am hesitating among various polynomial degrees (hyper-parameters).
In this wikipedia article, it seems to imply that the sequence should be:
- Split data into training set, validation set and test set
- Use the training set to fit the model (find the best parameters: coefficients of the polynomial).
- Afterwards, use the validation set to find the best hyper-parameters (in this case, polynomial degree) (wikipedia article says: "Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset")
- Finally, use the test set to score the model fitted with the training set.
However, this seems strange to me: how can you fit your model with the training set if you haven't chosen yet your hyper-parameters (polynomial degree in this case)?
I see three alternative approachs, I am not sure if they would be correct.
First approach
- Split data into training set, validation set and test set
- For each polynomial degree, fit the model with the training set and give it a score using the validation set.
- For the polynomial degree with the best score, fit the model with the training set.
- Evaluate with the test set
Second approach
- Split data into training set, validation set and test set
- For each polynomial degree, use cross-validation only on the validation set to fit and score the model
- For the polynomial degree with the best score, fit the model with the training set.
- Evaluate with the test set
Third approach
- Split data into only two sets: the training/validation set and the test set
- For each polynomial degree, use cross-validation only on the training/validation set to fit and score the model
- For the polynomial degree with the best score, fit the model with the training/validation set.
- Evaluate with the test set
So the question is:
- Is the wikipedia article wrong or am I missing something?
- Are the three approaches I envisage correct? Which one would be preferrable? Would there be another approach better than these three?