I have around 20k documents with 60 - 150 words. Out of these 20K documents, there are 400 documents for which the similar document are known. These 400 documents serve as my test data.
I am trying to find similar documents for these 400 datasets using gensim doc2vec. The paper "Distributed Representations of Sentences and Documents" says that "The combination of PV-DM and PV-DBOW often work consistently better (7.42% in IMDB) and therefore recommended."
So I would like to combine the vectors of these two methods and find cosine similarity with all the train documents and select the top 5 with the least cosine distance.
So what's the effective method to combine the vectors of these 2 methods: adding or averaging or any other method ???
After combining these 2 vectors I can normalise each vector and then find the cosine distance.