1
votes

I was reading dlib's face clustering code and noticed that the process is like so:

  1. Convert faces to vector using trained network
  2. Use Chinese whisper clustering algorithm to compute groups based on distance

Chinese whisper clustering can take a pretty long time when trying to cluster a large number (>10,000) images.

In this pyimagesearch article the author uses DBSCAN, another clustering algorithm to group a number of images by person.

Since the vectors generated by the neural net can be used to calculate the similarity between two faces, wouldn't it be better to just calculate a euclidean distance matrix, then search for all the values that meet a confidence threshold (eg x < 0.3 for 70% confidence)?

Why use a clustering algorithm at all when you can just compare every face with every other face to determine which ones are the same person? both DBSCAN and chinese whisper clustering take a much longer time than calculating a distance matrix. With my dataset of 30,000 images the times are:

C-whisper - 5 minutes

distance matrix + search - 10-20 seconds

2
How would you know what a good threshold is for the distance, though? 0.3 seems pretty arbitrary - why are you saying that gives 70% confidence?Nathan
@Nathan Well anyone can pick a distance value, it doesn't really matter. In Dlibs example he picks 0.5, which corresponds to a confidence of 0.5 This question was about: Why use a clustering algorithm when a distance matrix would work just as well/possibly even better?A_toaster
It does really matter; that's exactly why I would use a clustering algorithm instead of just checking which faces are close enough to a given face. If you magically knew what distance means faces are similar enough, that works great. But clustering is nice because you don't have to specify this thresholdNathan
@Nathan Ohhhhhh, okay well I guess that answers my question! In chinese whisper clustering you still have to specify a threshold - in dlibs example they use 0.5 - is that somehow different to using a confidence of 0.5?A_toaster
I'm still not sure what you mean when referring to a confidence. I'm only vaguely familiar with the Chinese whispering clustering algorithm, but I don't know of a confidence value that it returns, and certainly not DBSCAN, just clustering assignments. It is true, though, that each clustering algorithm has its own parameters that you need to set, so you don't escape entirely from the problem of specifying thresholds or other parameters of a sort. The hope would be that it's easier to specify the parameters of the clustering algorithm than to just give a hard distance cutoff.Nathan

2 Answers

2
votes

DBSCAN actually takes only marginally longer than computing a distance matrix (when implemented right, 99% of the computation is the distance computations) and with indexing can sometimes be much faster because it does not need every pairwise distance if the index can prune computations.

But you can't just "read off" clusters from the distance matrix. The data there may be contradictory: the face detector may consider A and B similar and B similar to C, but A and C dissimilar! What do you do then? Clustering algorithms try to solve exactly such situations. For example single-Link, and to a lesser extend DBSCAN, would make A and C the same cluster, whereas complete linkage would decide upon either AB or BC.

1
votes

Actually, dlib's implementation is doing something very similar to what you are thinking of. Here is the code. It first checks every pair and rejects pairs whose distance is greater than a threshold. This is exactly what you proposed. But then it is doing a fine clustering on the result. So, what would change?

Simply cutting off by distances can work if you have well-separated data points. However, if your data points are very close to each other, this problem becomes very hard. Imagine a 1D feature. Your data points are the integer positions between 0 and 10 and you want to put two data points together in a cluster if their distance is at most 1.5. So, what would you do? If you start with a pair, you could make a cluster. But if you pick a neighboring point, you will see that it will be closer than your threshold to one point already in the cluster and larger than the threshold to the other. Clustering is about solving this ambiguity.