gensim Word2Vec - how to apply stochastic gradient descent?

Question

To my understanding, batch (vanilla) gradient descent makes one parameter update for all training data. Stochastic gradient descent (SGD) allows you to update parameter for each training sample, helping the model to converge faster, at the cost of high fluctuation in function loss.

Batch (vanilla) gradient descent sets batch_size=corpus_size.

SGD sets batch_size=1.

And mini-batch gradient descent sets batch_size=k, in which k is usually 32, 64, 128...

How does gensim apply SGD or mini-batch gradient descent? It seems that batch_words is the equivalent of batch_size, but I want to be sure.

Is setting batch_words=1 in gensim model equivalent to applying SGD?

gojomo gojomo · Accepted Answer · 2019-05-02T17:30:23

No, batch_words in gensim refers to the size of work-chunks sent to worker threads.

The gensim Word2Vec class updates model parameters after each training micro-example of (context)->(target-word) (where context might be a single word, as in skip-gram, or the mean of several words, as in CBOW).

For example, you can review this optimized w2v_fast_sentence_sg_neg() cython function for skip-gram with negative-sampling, deep in the Word2Vec training loop:

https://github.com/RaRe-Technologies/gensim/blob/460dc1cb9921817f71b40b412e11a6d413926472/gensim/models/word2vec_inner.pyx#L159

Observe that it is considering exactly one target-word (word_index parameter) and one context-word (word2_index), and updating both the word-vectors (aka 'projection layer' syn0) and the model's hidden-to-output weights (syn1neg) before it might be called again with a subsequent single (context)->(target-word) pair.

gensim Word2Vec - how to apply stochastic gradient descent?

1 Answers