6
votes

My current doc2vec code is as follows.

# Train doc2vec model
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4, iter = 20)

I also have a word2vec code as below.

 # Train word2vec model
model = word2vec.Word2Vec(sentences, size=300, sample = 1e-3, sg=1, iter = 20)

I am interested in using both DM and DBOW in doc2vec AND both Skip-gram and CBOW in word2vec.

In Gensim I found the below mentioned sentence: "Produce word vectors with deep learning via word2vec’s “skip-gram and CBOW models”, using either hierarchical softmax or negative sampling"

Thus, I am confused either to use hierarchical softmax or negative sampling. Please let me know what are the differences in these two methods.

Also, I am interested in knowing what are the parameters that need to be changed to use hierarchical softmax AND/OR negative sampling with respect to dm, DBOW, Skip-gram and CBOW?

P.s. my application is a recommendation system :)

1

1 Answers

19
votes

Skip-gram or CBOW are different ways to choose the input contexts for the neural-network. Skip-gram picks one nearby word, then supplies it as input to try to predict a target word; CBOW averages together a bunch of nearby words, then supplies that average as input to try to predict a target word.

DBOW is most similar to skip-gram, in that a single paragraph-vector for a whole text is used to predict individual target words, regardless of distance and without any averaging. It can mix well with simultaneous skip-gram training, where in addition to using the single paragraph-vector, individual nearby word-vectors are also used. The gensim option dbow_words=1 will add skip-gram training to a DBOW dm=0 training.

DM is most similar to CBOW: the paragraph-vector is averaged together with a number of surrounding words to try to predict a target word.

So in Word2Vec, you must choose between skip-gram (sg=1) and CBOW (sg=0) – they can't be mixed. In Doc2Vec, you must choose between DBOW (dm=0) and DM (dm=1) - they can't be mixed. But you can, when doing Doc2Vec DBOW, also add skip-gram word-training (with dbow_words=1).

The choice between hierarchical-softmax and negative-sampling is separate and independent of the above choices. It determines how target-word predictions are read from the neural-network.

With negative-sampling, every possible prediction is assigned a single output-node of the network. In order to improve what prediction a particular input context creates, it checks the output-nodes for the 'correct' word (of the current training example excerpt of the corpus), and for N other 'wrong' words (that don't match the current training example). It then nudges the network's internal weights and the input-vectors to make the 'correct' word output node activation a little stronger, and the N 'wrong' word output node activations a little weaker. (This is called a 'sparse' approach, because it avoids having to calculate every output node, which is very expensive in large vocabularies, instead just calculation N+1 nodes and ignoring the rest.)

You could set negative-sampling with 2 negative-examples with the parameter negative=2 (in Word2Vec or Doc2Vec, with any kind of input-context mode). The default mode, if no negative specified, is negative=5, following the default in the original Google word2vec.c code.

With hierarchical-softmax, instead of every preictable word having its own output node, some pattern of multiple output-node activations is interpreted to mean specific words. Which nodes should be closer to 1.0 or 0.0 in order to represent a word is matter of the word's encoding, which is calculated so that common words have short encodings (involving just a few nodes), while rare words will have longer encodings (involving more nodes). Again, this serves to save calculation time: to check if an input-context is driving just the right set of nodes to the right values to predict the 'correct' word (for the current training-example), just a few nodes need to be checked, and nudged, instead of the whole set.

You enable hierarchical-softmax in gensim with the argument hs=1. By default, it is not used.

You should generally disable negative-sampling, by supplying negative=0, if enabling hierarchical-softmax – typically one or the other will perform better for a given amount of CPU-time/RAM.

(However, following the architecture of the original Google word2vec.c code, it is possible but not recommended to have them both active at once, for example negative=5, hs=1. This will result in a larger, slower model, which might appear to perform better since you're giving it more RAM/time to train, but it's likely that giving equivalent RAM/time to just one or the other would be better.)

Hierarchical-softmax tends to get slower with larger vocabularies (because the average number of nodes involved in each training-example grows); negative-sampling does not (because it's always N+1 nodes). Projects with larger corpuses tend to trend towards preferring negative-sampling.