After weeks of searching, I've found the answer. This could be useful to anyone interested in understand Word2Vec (and word embedding in general) as opposed to just using it.
When training the neural network, the input is a one-hot vector and the output can be a concatenation (or average) of one-hot vectors, which is your context. In the middle is the hidden layer. The hidden layer has d units. So the input to hidden is a |V| x d matrix. Each row in that matrix is the word embedding corresponding to the non-zero unit in your one-hot vector
For example, if a word encoded in one-hot vector is [0, 0, 1, 0], it will be input into your neural network transposed. Notice only one unit is non-zero, so only that input unit fires to all hidden units. So the 3rd row in your matrix is the only one we care about, hence your word embedding is a row in the matrix.
I hope that helps anyone interested (maybe I'm the only one?)