1
votes

I have a set of document vectors generated using gensim doc2vec (~500K vectors of 150 dimensions). I wish to cluster similar documents for which i want to generate a n*n similarity matrix over which i can run my clustering algorithm.

I tried instructions of this link https://github.com/RaRe-Technologies/gensim/issues/140 using the gensim.similarities but the output for 500k records was 500k*150 matrix. I dont understand the output. Shouldn't it be 500k * 500k ? am i missing something?

1

1 Answers

3
votes

That is the embedding that you are looking at. 150 dimensional vectors per document.

No, you do not want to compute a similarity matrix.

Did you do the math? 500k x 500k x 8 bytes per double / 2. Do you have enough main memory (over 1 TB) for this matrix? How long does it take to compute? What clustering algorithm do you mean to run next, and how long will that take?

Start with smaller data first, and find a working approach. Then estimate how long it would take to scale to your entire data. Don't scale first just to find out that you have no idea what you are doing.