I'm working on some convolutional neural network stuff and I've been reading up the difference between these three and am having some issues. I'm looking at this website http://sebastianruder.com/optimizing-gradient-descent/.
In it the author says it computes the gradient of the cost function with respect to the weights for the entire dataset. I'm confused as to how the entire training dataset gets applied. To me stochastic makes intuitive sense as I put in a single image into the model, get a prediction for the cost function and then optimize. How does having more than one value apply to the cost function for mini batch and batch gradient descent?
Thanks
SGD
: take one image, obtain one gradient and modify your vector. Then take another image and do the same, but this time your new gradient is already considering the modified vector (after image 0).Batch-gradient
: take image 0, obtain gradient, modify vector; take image 1, ... (the modified vector of the learning-variables will only be seen in the next epoch; all images calculate the gradient with respect to the start-vector). - sascha