To my understanding, batch (vanilla) gradient descent makes one parameter update for all training data. Stochastic gradient descent (SGD) allows you to update parameter for each training sample, helping the model to converge faster, at the cost of high fluctuation in function loss.
Batch (vanilla) gradient descent sets batch_size=corpus_size
.
SGD sets batch_size=1
.
And mini-batch gradient descent sets batch_size=k
, in which k
is usually 32, 64, 128...
How does gensim apply SGD or mini-batch gradient descent? It seems that batch_words
is the equivalent of batch_size
, but I want to be sure.
Is setting batch_words=1
in gensim model equivalent to applying SGD?