3
votes

My intent is to cluster document vectors from doc2vec using HDBSCAN. I want to find tiny clusters where there are semantical and textual duplicates.

To do this I am using gensim to generate document vectors. The elements of the resulting docvecs are all in the range [-1,1].

To compare two documents I want to compare the angular similarity. I do this by calculating the cosine similarity of the vectors, which works fine.

But, to cluster the documents HDBSCAN requires a distance matrix, and not a similarity matrix. The native conversion from cosine similarity to cosine distance in sklearn is 1-similarity. However, it is my understanding that using this formula can break the triangle inequality preventing it from being a true distance metric. When searching and looking at other people's code for similar tasks, it seems that most people seem to be using sklearn.metrics.pairwise.pairwise_distances(data, metric='cosine') which is defines cosine distance as 1-similarity anyway. It looks like it provides appropriate results.

I am wondering if this is correct, or if I should use angular distance instead, calculated as np.arccos(cosine similarity)/pi. I have also seen people use Euclidean distance on l2-normalized document vectors; this seems to be equivalent to cosine similarity.

Please let me know what is the most appropriate method for calculating distance between document vectors for clustering :)

2

2 Answers

1
votes

I believe in practice cosine-distance is used, despite the fact that there are corner-cases where it's not a proper metric.

You mention that "elements of the resulting docvecs are all in the range [-1,1]". That isn't usually guaranteed to be the case – though it would be if you've already unit-normalized all the raw doc-vectors.

If you have done that unit-normalization, or want to, then after such normalization euclidean-distance will always give the same ranked-order of nearest-neighbors as cosine-distance. The absolute values, and relative proportions between them, will vary a little – but all "X is closer to Y than Z" tests will be identical to those based on cosine-distance. So clustering quality should be nearly identical to using cosine-distance directly.

1
votes

The proper similarity metric is the dot product, not cosine.

Word2vec etc. are trained using the dot product, not normalized by the vector length. And you should exactly use what was trained.

People use the cosine all the time because it worked well for bag of words. The choice is not based on a proper theoretical analysis for all I know.

HDBSCAN does not require a metric. The 1-sim transformation assumes that x is bounded by 1, so that won't reliably work.

I suggest to try the following approaches:

  • use negative distances. That may simply work. I.e., d(x,y)=-(x dot y)
  • use max-sim transformation. Once you have the dot product matrix it is easy to get the maximum value.
  • implement HDBSCAN* with a similarity rather than a metric