In word2vec, after training, we get two weight matrixes:1.input-hidden weight matrix; 2.hidden-output weight matrix. and people will use the input-hidden weight matrix as the word vectors(each row corresponds to a word, namely, the word vectors).Here comes to my confusions:
- why people use input-hidden weight matrix as the word vectors instead of the hidden-output weight matrix.
- why don't we just add softmax activation function to the hidden layers rather than output layers, thus preventing time-consuming.
Plus, clarifying remarks on the intuition of how word vectors can be obtained like this will be appreciated.