word2vec - what is best? add, concatenate or average word vectors?

Question

I am working on a recurrent language model. To learn word embeddings that can be used to initialize my language model, I am using gensim's word2vec model. After training, the word2vec model holds two vectors for each word in the vocabulary: the word embedding (rows of input/hidden matrix) and the context embedding (columns of hidden/output matrix).

As outlined in this post there are at least three common ways to combine these two embedding vectors:

summing the context and word vector for each word
summing & averaging
concatenating the context and word vector

However, I couldn't find proper papers or reports on the best strategy. So my questions are:

Is there a common solution whether to sum, average or concatenate the vectors?
Or does the best way depend entirely on the task in question? If so, what strategy is best for a word-level language model?
Why combine the vectors at all? Why not use the "original" word embeddings for each word, i.e. those contained in the weight matrix between input and hidden neurons.

Related (but unanswered) questions:

You might want to add what you are trying to do, e.g. build a sentence or paragraph level vector. (Gensim for example offers doc2vec for that) — de1
I want to initialize my recurrent language model with the word embeddings produced by gensim. So my goal is to learn an embedding for each word in my vocabulary. After training the word2vec model, I can use the original embeddings or modify them further (as outlined in the post). I want to know which strategy yields the "best" word embeddings — Lemon
In the first post you linked, the question is about creating a sentence vector. i.e. combine the word vectors to a single vector representing the sentence (or paragraph). That is where the question about how to combine the vectors seems to be most relevant. Is that what you want to do? — de1
Not sure whether I understand your question. I am building a language model that is fed with sequential words and trained to predict the next word in a sentence. Each input word is mapped to an embedding. I use gensim to learn these word embeddings. My goal is to get the best possible word embeddings. — Lemon
Okay, then it doesn't sound like you are trying to do that. As far as I know, the combination of vectors you referred to are used to create a single vector out of a number of vectors. Not to improve the word vectors themselves. But perhaps someone else knows better. To get better vectors you could obviously look into the training data, size of the embedding or alternative methods such as GloVe. Also including the type of word within sentence could potentially improve the vector (see Sense2Vec). — de1

Lemon Lemon · Accepted Answer · 2018-01-18T11:51:26

I have found an answer in the Stanford lecture "Deep Learning for Natural Language Processing" (Lecture 2, March 2016). It's available here. In minute 46 Richard Socher states that the common way is to average the two word vectors.

word2vec - what is best? add, concatenate or average word vectors?

4 Answers