2
votes

I am using pre-trained fastText (https://fasttext.cc/) vectors to perform clustering on short chat messages. This means that the resulting vector will be an average of the tokens composing the message.

I started using k-means initially but I am now wondering whether it is the right choice. For instance, K-means uses the Euclidean distance while in most cases, word embedding similarity is computed using cosine similarity.

How to choose the right clustering method in this case?

1

1 Answers

1
votes

Interestingly, the length of the vectors out of word2vec seems to correspond to the "significance" of the word, where the angle corresponds to meaning, so the answer would depend on what is most important for your use case.

https://stats.stackexchange.com/questions/177905/should-i-normalize-word2vecs-word-vectors-before-using-them

If the vectors are normalized, the euclidian and cosine distance will be the same.

You may want to try Annoy (made by Spotify engineering) - this lets you build clusters using different distance measures: https://github.com/spotify/annoy