Confusion with batch, stochastic, and mini-batch gradient descent

Question

I'm working on some convolutional neural network stuff and I've been reading up the difference between these three and am having some issues. I'm looking at this website http://sebastianruder.com/optimizing-gradient-descent/.

In it the author says it computes the gradient of the cost function with respect to the weights for the entire dataset. I'm confused as to how the entire training dataset gets applied. To me stochastic makes intuitive sense as I put in a single image into the model, get a prediction for the cost function and then optimize. How does having more than one value apply to the cost function for mini batch and batch gradient descent?

Thanks

What exactly is the problem? SGD: take one image, obtain one gradient and modify your vector. Then take another image and do the same, but this time your new gradient is already considering the modified vector (after image 0). Batch-gradient: take image 0, obtain gradient, modify vector; take image 1, ... (the modified vector of the learning-variables will only be seen in the next epoch; all images calculate the gradient with respect to the start-vector). — sascha
Ah ok. So I guess my question is how is the new gradient calculated after the epoch finishes. You state that you obtain the gradient per image and modify the vector(I guess a temporary one?) and that this is done for all of them. So does that mean there's a modified vector per image? And then once the epoch is up how are those applied to the actual weights? Average? Thanks for the comment. — Exuro
No average. It's a sum. And in the SGD-setting, the gradient is already touched by the last update, even in epoch0. Read up Wiki or any other source which goes into the basics.The implications, i'm afraid are much much more complex (convergence guarantees, convergence rate, local/global convergence probabilities, variance...). — sascha
Actually it is supposed to be an average of the gradient, at least if you want the learning rate to still have the same meaning as in SGD case (otherwise your learning rate needs to be much smaller) — lejlot

backtothemoon backtothemoon · Accepted Answer · 2018-10-08T17:38:12

Assume you have a dataset of 10 images.

SGD:

Take 1 image at a time.
Back-propagate the loss for the image and update gradient.
Repeat for the remaining 9 images.

Batch GD:

Take entire dataset at a time - 10 images
Backpropagate the loss for all 10 images and then update the average gradient.

MiniBatch GD:

Divide dataset into 5 minibatches of 2 images each.
Take minibatch 1.
Backpropagate the loss for images in the minibatch and then update the average gradient.
Repeat for the remaining 4 minibatches.

Confusion with batch, stochastic, and mini-batch gradient descent

1 Answers