My intent is to cluster document vectors from doc2vec using HDBSCAN. I want to find tiny clusters where there are semantical and textual duplicates.
To do this I am using gensim to generate document vectors. The elements of the resulting docvecs are all in the range [-1,1].
To compare two documents I want to compare the angular similarity. I do this by calculating the cosine similarity of the vectors, which works fine.
But, to cluster the documents HDBSCAN requires a distance matrix, and not a similarity matrix. The native conversion from cosine similarity to cosine distance in sklearn
is 1-similarity
. However, it is my understanding that using this formula can break the triangle inequality preventing it from being a true distance metric. When searching and looking at other people's code for similar tasks, it seems that most people seem to be using sklearn.metrics.pairwise.pairwise_distances(data, metric='cosine')
which is defines cosine distance as 1-similarity
anyway. It looks like it provides appropriate results.
I am wondering if this is correct, or if I should use angular distance instead, calculated as np.arccos(cosine similarity)/pi
. I have also seen people use Euclidean distance on l2-normalized document vectors; this seems to be equivalent to cosine similarity.
Please let me know what is the most appropriate method for calculating distance between document vectors for clustering :)