I'm training my Doc2Vec model on 106k documents (100-600 words per document). The goal is to retrieve similar documents for a target document.
Since Doc2Vec is an unsupervised model there is no real evaluation possible except to test how it performs on your downstream task. So, I created a small dataset containing about 200 target documents and 5 similar documents per target.
My idea is to calculate the cosine similarity for every document against all other documents in my test dataset and get top 5 similar documents per target document.
Is there an efficient way to create a cosine similarity matrix with Doc2Vec? The most_similar
function is impractical as it retrieves every similar document used for training.