Is it correct to test model performance over the entire dataset?

Question

The dataset is divided into training and testing sets using the function train_test_split() in 75:25 ratio.

The model is trained on the data set x_train and y_train.(classifier models like gaussian naive bayes, random forest, k nearest neighous ,etc)

Can we now test the model using the complete data set i.e, x and y? Or should we only use x_test and y_test for testing the model?

you should use only the test data for measuring the generalisation error. — Venkatachalam

Vaidøtas I. Vaidøtas I. · Accepted Answer · 2020-02-27T18:00:31

train_test_split() is meant to give you a simpler way of creating training and test subsets from your original dataset. x_train and y_train both represent training data and target data, useful to train a model like the ones mentioned to finally test on the test subsets.

this is for training, i.e. practice. testing on the entire dataset is wrong, because your model will crearly be biased on data it was trained on from x_train y_train. you should test your models on never-before-seen y_test data

Is it correct to test model performance over the entire dataset?

1 Answers