0
votes

I came across few blog posts stating that, Document vectors can be generated not only by Doc2Vec, but also by averaging the word vectors obtained by running Word2vec algorithm. In that case, would the vectors generated through both the Algorithms be the same? Which would be the most efficient way to generate the Document vectors and Why?

Any reference links in this regard would be of great help!!

Thanks in Advance

1

1 Answers

0
votes

Those are two different methods of creating a vector for a set-of-words.

The vectors will be in different positions, and of different quality.

Averaging is quite fast, especially if you've already got word-vectors. But it's a very simple approach that won't capture many shades of meaning – indeed it is completely oblivious to word ordering/relative proximities, and the act of averaging can tend to 'cancel out' contrasting meanings in the text.

Doc2Vec instead trains vectors for full texts in a manner very similar to word-vectors (and often, alongside word-vectors). Essentially, a pretend-word that's assigned to the text 'floats' alonside the word-vector training, as if it were 'near' all the other word-training (for that one text). It's a slightly more sophisticated approach, but as it uses a very-similar algorithm (& model-complexity) on the same data, results on many downstream evaluations are often similar.

To obtain summary text-vectors capturing more subtle shades of meaning, as implied by grammatical rules and more advanced language usage, can require yet-more-sophisticated methods, such as those employing larger deep networks.

There's no single most efficient approach, as all real uses depend a lot on the type, quantity, and quality of your texts, and your intended uses of the vectors.