why we use input-hidden weight matrix to be the word vectors instead of hidden-output weight matrix?

Question

In word2vec, after training, we get two weight matrixes:1.input-hidden weight matrix; 2.hidden-output weight matrix. and people will use the input-hidden weight matrix as the word vectors(each row corresponds to a word, namely, the word vectors).Here comes to my confusions:

why people use input-hidden weight matrix as the word vectors instead of the hidden-output weight matrix.
why don't we just add softmax activation function to the hidden layers rather than output layers, thus preventing time-consuming.

Plus, clarifying remarks on the intuition of how word vectors can be obtained like this will be appreciated.

glee glee · Accepted Answer · 2018-07-19T04:29:15

Regarding the two, input-hidden weight matrix and hidden-output weight matrix, there is an interesting research paper. 'A Dual Embedding Space Model for Document Ranking', Mitra et al., arXiv 2016. (https://arxiv.org/pdf/1602.01137.pdf). Similar with your question, this paper studies how these two weight matrix are different, and claims that they encode different characteristics of words.

Overall, from my understanding, it is your choice to use either the input-hidden weight matrix (convention), hidden-output weight matrix, or the combined one as word embeddings, depending on your data and the problem to solve.

why we use input-hidden weight matrix to be the word vectors instead of hidden-output weight matrix?

2 Answers