1
votes

I'm training my Doc2Vec model on 106k documents (100-600 words per document). The goal is to retrieve similar documents for a target document.

Since Doc2Vec is an unsupervised model there is no real evaluation possible except to test how it performs on your downstream task. So, I created a small dataset containing about 200 target documents and 5 similar documents per target.

My idea is to calculate the cosine similarity for every document against all other documents in my test dataset and get top 5 similar documents per target document.

Is there an efficient way to create a cosine similarity matrix with Doc2Vec? The most_similar function is impractical as it retrieves every similar document used for training.

1

1 Answers

0
votes

You could use sklearn's cosine_similarity function for this. Once you have the list of 200 vectors, you can just convert to numpy array and pass it through this function. It will give you pairwise similarity matrix. Later you can use argsort() function to get the indices of the documents that are closest. For top-k matching, you could use arr.argsort()[-k:][::-1].