Why don't Face Clustering Algorithms use Distance matrices rather than clustering algorithms?

Question

I was reading dlib's face clustering code and noticed that the process is like so:

Convert faces to vector using trained network
Use Chinese whisper clustering algorithm to compute groups based on distance

Chinese whisper clustering can take a pretty long time when trying to cluster a large number (>10,000) images.

In this pyimagesearch article the author uses DBSCAN, another clustering algorithm to group a number of images by person.

Since the vectors generated by the neural net can be used to calculate the similarity between two faces, wouldn't it be better to just calculate a euclidean distance matrix, then search for all the values that meet a confidence threshold (eg x < 0.3 for 70% confidence)?

Why use a clustering algorithm at all when you can just compare every face with every other face to determine which ones are the same person? both DBSCAN and chinese whisper clustering take a much longer time than calculating a distance matrix. With my dataset of 30,000 images the times are:

C-whisper - 5 minutes

distance matrix + search - 10-20 seconds

How would you know what a good threshold is for the distance, though? 0.3 seems pretty arbitrary - why are you saying that gives 70% confidence? — Nathan
@Nathan Well anyone can pick a distance value, it doesn't really matter. In Dlibs example he picks 0.5, which corresponds to a confidence of 0.5 This question was about: Why use a clustering algorithm when a distance matrix would work just as well/possibly even better? — A_toaster
It does really matter; that's exactly why I would use a clustering algorithm instead of just checking which faces are close enough to a given face. If you magically knew what distance means faces are similar enough, that works great. But clustering is nice because you don't have to specify this threshold — Nathan
@Nathan Ohhhhhh, okay well I guess that answers my question! In chinese whisper clustering you still have to specify a threshold - in dlibs example they use 0.5 - is that somehow different to using a confidence of 0.5? — A_toaster
I'm still not sure what you mean when referring to a confidence. I'm only vaguely familiar with the Chinese whispering clustering algorithm, but I don't know of a confidence value that it returns, and certainly not DBSCAN, just clustering assignments. It is true, though, that each clustering algorithm has its own parameters that you need to set, so you don't escape entirely from the problem of specifying thresholds or other parameters of a sort. The hope would be that it's easier to specify the parameters of the clustering algorithm than to just give a hard distance cutoff. — Nathan

Has QUIT--Anony-Mousse Has QUIT--Anony-Mousse · Accepted Answer · 2019-04-02T06:15:27

DBSCAN actually takes only marginally longer than computing a distance matrix (when implemented right, 99% of the computation is the distance computations) and with indexing can sometimes be much faster because it does not need every pairwise distance if the index can prune computations.

But you can't just "read off" clusters from the distance matrix. The data there may be contradictory: the face detector may consider A and B similar and B similar to C, but A and C dissimilar! What do you do then? Clustering algorithms try to solve exactly such situations. For example single-Link, and to a lesser extend DBSCAN, would make A and C the same cluster, whereas complete linkage would decide upon either AB or BC.

Why don't Face Clustering Algorithms use Distance matrices rather than clustering algorithms?

2 Answers