Is there a standard methodology to compare results (for accuracy) of a classification algorithm against a clustering algorithm? I have data that has only two true labels. Easy enough to check accuracy when I run a binary classification on it, but if I run clustering, where I ask it to cluster the data into 5 groups, how can I check the accuracy and compare it to the binary classification. I know clustering is not suitable for (two label) data but how can one prove this mathematically?
2 Answers
Clustering into more than two clusters is one way to do 2-class classification (just pick which ever label is more common in each cluster to be the predicted label for the cluster). However it's a very strange approach because it ignores the labels until the very end after the clustering is computed. Supervised learning (i.e. classification) provides much more powerful tools like random forests for classification.
Don't approach clustering as classification
They have very different objectives, and really should not be compared. Classification is about reproducing known labels, and you need to pay attention to overfitting, train/test splitting etc. Clustering on the other hand is exploratory. Any truly exploratory method will eventually not find anything, or will turn up obvious results only.
By trying to evaluate it the same way as classification, you "overfit" to clustering methods that yield the obvious, if anything.
Instead, evaluate Clustering by looking at the results. If you learn something from the result, then it was good. If not, try again.
Don't try to stick a number on everything
There is more than black, white, and 50 shades of grey. Putting everything into a single number is a grayscale view of the world... it's popular (so is "good vs. evil" thinking); but in science we should do better.