We know that Prediction and Classification problems can break data according to a training ratio (generally 70-30 or 80-20 split), where the training data is passed to a model to be fit and its output is tested against the test data.
Let's Say if I have a data with 2 columns:
- First column: Employee Age
- Second Column: Employee Salary Type
With 100 records similar to this:
Employee Age Employee Salary Type
25 low
35 medium
26 low
37 medium
44 high
45 high
if the Training data is split by the ratio 70:30,
Let the Target variable be Employee Salary Type and predicted variable be Employee Age
The data is trained on 70 records and tested against the remaining 30 records while hiding their target variables.
Let's say, 25 out of 30 records have accurate prediction.
Accuracy of the model = (25/30)*100 = 83.33%
Which means the model is good
Lets apply same thing for an unsupervised learning like Clustering.
Here there's no target variable, Only cluster variables are present.
Lets consider both Employee age and Employee Salary as Cluster Variables.
Then data will be automatically clustered according to
Employees with low age and low salary
Employees with medium age and medium salary
Employees high age and high salary
If the Training ratio is applied here, We can cluster 70 random records and use rest of the 30 records for testing/validating the above model instead of testing with some other data (and their records). Here we need to model fit 70% records and again need to model fit rest 30% records thereby we need to compare characteristics of cluster 1 of 70% data and characteristics of cluster 1 of rest 30% data.If characteristics are similar then we can reach the inference that clustering model was good.
Hence accuracy can be accurately measured here.
Why dont people prefer train/test/split for Unsupervised Analysis like Clustering, Association Rules, Forecasting, etc.