0
votes

Is there a standard methodology to compare results (for accuracy) of a classification algorithm against a clustering algorithm? I have data that has only two true labels. Easy enough to check accuracy when I run a binary classification on it, but if I run clustering, where I ask it to cluster the data into 5 groups, how can I check the accuracy and compare it to the binary classification. I know clustering is not suitable for (two label) data but how can one prove this mathematically?

2

2 Answers

1
votes

Clustering into more than two clusters is one way to do 2-class classification (just pick which ever label is more common in each cluster to be the predicted label for the cluster). However it's a very strange approach because it ignores the labels until the very end after the clustering is computed. Supervised learning (i.e. classification) provides much more powerful tools like random forests for classification.

0
votes

Don't approach clustering as classification

They have very different objectives, and really should not be compared. Classification is about reproducing known labels, and you need to pay attention to overfitting, train/test splitting etc. Clustering on the other hand is exploratory. Any truly exploratory method will eventually not find anything, or will turn up obvious results only.

By trying to evaluate it the same way as classification, you "overfit" to clustering methods that yield the obvious, if anything.

Instead, evaluate Clustering by looking at the results. If you learn something from the result, then it was good. If not, try again.

Don't try to stick a number on everything

There is more than black, white, and 50 shades of grey. Putting everything into a single number is a grayscale view of the world... it's popular (so is "good vs. evil" thinking); but in science we should do better.