Why are data not split in training and testing for unsupervised learning algorithms?

Question

We know that Prediction and Classification problems can break data according to a training ratio (generally 70-30 or 80-20 split), where the training data is passed to a model to be fit and its output is tested against the test data.

Let's Say if I have a data with 2 columns:

First column: Employee Age
Second Column: Employee Salary Type

With 100 records similar to this:

Employee Age   Employee Salary Type  

25               low
35               medium
26               low
37               medium
44               high
45               high

if the Training data is split by the ratio 70:30, Let the Target variable be Employee Salary Type and predicted variable be Employee Age

The data is trained on 70 records and tested against the remaining 30 records while hiding their target variables.

Let's say, 25 out of 30 records have accurate prediction.

Accuracy of the model = (25/30)*100 = 83.33%

Which means the model is good

Lets apply same thing for an unsupervised learning like Clustering.

Here there's no target variable, Only cluster variables are present.

Lets consider both Employee age and Employee Salary as Cluster Variables.

Then data will be automatically clustered according to

Employees with low age and low salary 
Employees with medium age and medium salary 
Employees high age and high salary

If the Training ratio is applied here, We can cluster 70 random records and use rest of the 30 records for testing/validating the above model instead of testing with some other data (and their records). Here we need to model fit 70% records and again need to model fit rest 30% records thereby we need to compare characteristics of cluster 1 of 70% data and characteristics of cluster 1 of rest 30% data.If characteristics are similar then we can reach the inference that clustering model was good.

Hence accuracy can be accurately measured here.

Why dont people prefer train/test/split for Unsupervised Analysis like Clustering, Association Rules, Forecasting, etc.

Which people do you refer to in "Why dont people prefer ..."? — dedObed
Not a programming question, hence arguably off-topic here; better suited for Cross Validated. — desertnaut
You sound confused; in your second example the very notion of accuracy does not even exist (let alone be measurable) — desertnaut

Nathan McCoy Nathan McCoy · Accepted Answer · 2019-07-17T14:16:11

I beleive you have a few misconceptions, here is a quick review:

Review

Unsupervised learning

This is when you have data inputs but no labels, and learn something about the inputs

Semi-supervised learning

This is when you have data inputs and same labels, and learn something about the inputs and their relationship to the labels

Supervised learning

This is when you have data inputs and labels, and learn what input maps to which label

Questions

Now you have a few things you mention that dont seem right:

Then data will be automatically clustered according to

Employees with low age and low salary 
Employees with medium age and medium salary 
Employees high age and high salary

This is only guaranteed If you features represent employees using the age and salary, and you are using a clustering algorithm, you need to define a distance metric which says age and salaray are closer to one another

You also mention:

If the Training ratio is applied here, 
We can cluster 70 random records and use rest of the 
30 records for testing/validating 
the above model instead of testing with 
some other data (and their records).

Hence accuracy can be accurately measured here.

How do you know the labels? If you are clustering, you would not know what each cluster means as they are assigned only by your distance metric. A cluster usually only signifies distances being either closer or farther away.

You can never know what a correct label is unless you know that a cluster represents a certain label, but if you are using features to cluster and check distance on, they could not also be used for validation.

This is because you would always get 100% accuracy, since a feature is also a label.

A semi-supervised example

I think your misconception comes as you may be confusing learning types, so let's make an example using some fake data.

Let's say you have a table of data with Employee entries like the following:

Employee
  Name
  Age
  Salary
  University degree
  University graduation date
  Address

Now let's say some employees dont want to say their age, since it is not mandatory, but some do. Then you can use a semi-supervised learning approach to cluster employees and get information about their age.

Since we want to get the age, we can approximate by clustering.

Let's make features that represent the Employee age to help us cluster them together:

employee_vector = [salary, graduation, address]

With our input, we are making the claim that age can be determined by salary, graduation date and address, which might be true.

Let's say we have represented all these values numerically, then we can cluster items together.

What would these clusters mean with a standard distance metric Euclidian distance?

People who have less distant salaries, gratuation dates and addresses would be clustered together.

Then we could look at the clusters they are in and look at information about the ages we do know.

for cluster_id, employees in clusters:
    ages = get_known_ages(employees)

Now we could use the ages to do lot's of operations to guess missing employee ages like using a normal distribution or just showing a min/max range.

We could never know what the exact age is, since the clustering does not know that.

We could never test for age, since it is not always known, and is not used in the feature vectors for the employees.

This is why you could not use purely unsupervised approaches since you have no labels.