0
votes

I have around 20k documents with 60 - 150 words. Out of these 20K documents, there are 400 documents for which the similar document are known. These 400 documents serve as my test data.

I am trying to find similar documents for these 400 datasets using gensim doc2vec. The paper "Distributed Representations of Sentences and Documents" says that "The combination of PV-DM and PV-DBOW often work consistently better (7.42% in IMDB) and therefore recommended."

So I would like to combine the vectors of these two methods and find cosine similarity with all the train documents and select the top 5 with the least cosine distance.

So what's the effective method to combine the vectors of these 2 methods: adding or averaging or any other method ???

After combining these 2 vectors I can normalise each vector and then find the cosine distance.

1

1 Answers

1
votes

The paper implies they've concatenated the vectors from the two methods. For example, given a 300d PV-DBOW vector, and a 300d PV-DM vector, you'd get a 600d vector for your text after concatenation.

However, note that their bottom-line results on IMDB have been hard for outsiders to reproduce. My test have only sometimes shown a small advantage for these concatenated vectors. (I especially wonder if 300d PV-DBOW + 300d PV-DM via separate-concatenated-models would be any better than just training a true 600d model of either, for the same amount of time, with fewer steps/complications.)

You can view my demonstration of repeating some of the experiments of the original 'Paragraph Vector' paper in one of the the example notebooks included with gensim in its docs/notebooks directory:

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb

It includes, among other things, a few steps and helpful methods for treating pairs of models as a concatenated whole.