I am trying to cluster a number of words using the KMeans
algorithm from scikit learn
.
In particular, I use pre-trained word embeddings (300 dimensional vectors) to map each word with a number vector and then I feed these vectors to KMeans and provide the number of clusters.
My issue is that there are certain words in my input corpus which I can not find in the pretrained word embeddings dictionary. This means that in these cases, instead of a vector, I get a numpy array full of nan
values. This does not work with the kmeans algorithm and therefore I have to exclude these arrays. However, I am interested in seeing all these cases that were not found in the word embeddings and what is more, if possible throw them inside a separate cluster that will contain only them.
My idea at this point is to set a condition that if the word is returned with a nan-values array from the embeddings index, then assign an arbitrary vector to it. Each dimension of the embeddings vector lie within [-1,1]
. Therefore, if I assign the following vector [100000]*300
to all nan words, I have created a set of outliers. In practice, this works as expected, since this particular set of vectors are forced in a separate cluster. However, the initialization of the kmeans centroids is affected by these outlier values and therefore all the rest of my clusters get messed up as well. As a remedey, I tried to initiate the kmeans using init = k-means++
but first, it takes significantly longer to execute and second the improvement is not much better.
Any suggestions as to how to approach this issue?
Thank you.