1
votes

I am trying to cluster a number of words using the KMeans algorithm from scikit learn.

In particular, I use pre-trained word embeddings (300 dimensional vectors) to map each word with a number vector and then I feed these vectors to KMeans and provide the number of clusters.

My issue is that there are certain words in my input corpus which I can not find in the pretrained word embeddings dictionary. This means that in these cases, instead of a vector, I get a numpy array full of nan values. This does not work with the kmeans algorithm and therefore I have to exclude these arrays. However, I am interested in seeing all these cases that were not found in the word embeddings and what is more, if possible throw them inside a separate cluster that will contain only them.

My idea at this point is to set a condition that if the word is returned with a nan-values array from the embeddings index, then assign an arbitrary vector to it. Each dimension of the embeddings vector lie within [-1,1]. Therefore, if I assign the following vector [100000]*300 to all nan words, I have created a set of outliers. In practice, this works as expected, since this particular set of vectors are forced in a separate cluster. However, the initialization of the kmeans centroids is affected by these outlier values and therefore all the rest of my clusters get messed up as well. As a remedey, I tried to initiate the kmeans using init = k-means++ but first, it takes significantly longer to execute and second the improvement is not much better.

Any suggestions as to how to approach this issue?

Thank you.

1
Why are you using 10000 as the size when everything else is between -1 and 1? This will massively skew the kmeans clustering algorithm. I would still just fill with an arbitrary vector, just make it all 0 or 1 or -1. It should not affect the clustering algorithm as much and ideally still lets you cluster unknown words together. Or try one of the other clustering methods on sklearn, they may handle outliers better.Ken Syme

1 Answers

0
votes

If you don't have data on a word, then skip it.

You could try to compute a word vector on the fly based on the context, but that essentially is the same as just skipping it.