2
votes

I'm using Google's Word2vec and I'm wondering how to get the top words that are predicted by a skipgram model that is trained using hierarchical softmax, given an input word?

For instance, when using negative sampling, one can simply multiply an input word's embedding (from the input matrix) with each of the vectors in the output matrix and take the one with the top value. However, in hierarchical softmax, there are multiple output vectors that correspond to each input word, due to the use of the Huffman tree.

How do we compute the likelihood value/probability of an output word given an input word in this case?

1

1 Answers

0
votes

I haven't seen any way to do this, and given the way hierarchical-softmax (HS) outputs work, there's no obviously correct way to turn the output nodes' activation levels into a precise per-word likelihood estimation. Note that:

  • the predict_output_word() method that (sort-of) simulates a negative-sampling prediction doesn't even try to handle HS mode

  • during training, neither HS nor negative-sampling modes make exact predictions – they just nudge the outputs to be more like the current training example would require

To the extent you could calculate all output node activations for a given context, then check each word's unique HS code-point node values for how close they are to "being predicted", you could potentially synthesize relative scores for each word – some measure of how far the values are from a "certain" output of that word. But whether and how each node's deviation should contribute to that score, and how that score might be indicative of a interpretable liklihood, is unclear.

There could also be issues because of the way HS codes are assigned strictly by word-frequency – so 'neighbor' word sharing mostly-the-same-encoding may be very different semantically. (There were some hints in the original word2vec.c code that it could potentially be beneficial to assign HS-encodings by clustering related words to have similar codings, rather than by strict frequency, but I've seen little practice of that since.)

I would suggest sticking to negative-sampling if interpretable predictions are important. (But also remember, word2vec isn't mainly used for predictions, it just uses the training-attempts-at-prediction to bootstrap a vector-arrangment that turn out to be useful for other tasks.)