Training Data Vs. Test Data

Question

This might sound like an elementary question but I am having a major confusion regarding Training Set and Test.

When we use Supervised learning techniques such as Classification to predict something a common practice is to split the dataset into two parts training and test set. The training set will have a predictor variable, we train the model on the dataset and "predict" things.

Let's take an example. We are going to predict loan defaulters in a bank and we have the German credit data set where we are predicting defaulters and non- defaulters but there is already a definition column which says whether a customer is a defaulter or Non-defaulter.

I understand the logic of prediction on UNSEEN data, like the Titanic survival data but what is the point of prediction where a class is already mentioned, such as German credit lending data.

Lan Lan · Accepted Answer · 2017-09-10T03:02:10

As you said, the idea is to come up a model that you can predict UNSEEN data. The test data is only used to measure the performance of your model created through training data. You want to make sure the model you comes up does not "overfit" your training data. That's why the testing data is important. Eventually, you will use the model to predict whether a new loaner is going to default or not, thus making a business decision whether to approve the loan application.

Training Data Vs. Test Data

3 Answers