Gradient Descent vs Stochastic Gradient Descent algorithms

Question

I tried to train a FeedForward Neural Network on the MNIST Handwritten Digits dataset (includes 60K training samples).

I each time iterated over all the training samples, performing Backpropagation for each such sample on every epoch. The runtime is of course too long.

Is the algorithm I ran named Gradient Descent?

I read that for large datasets, using Stochastic Gradient Descent can improve the runtime dramatically.

What should I do in order to use Stochastic Gradient Descent? Should I just pick the training samples randomly, performing Backpropagation on each randomly picked sample, instead of the epochs I currently use?

Perhaps you need to choose correctly value of the learning rate. More explanation of SGD can be found here. — vcp

Diogo Pinto Diogo Pinto · Accepted Answer · 2016-03-01T09:49:11

I'll try to give you some intuition over the problem...

Initially, updates were made in what you (correctly) call (Batch) Gradient Descent. This assures that each update in the weights is done in the "right" direction (Fig. 1): the one that minimizes the cost function.

With the growth of datasets size, and complexier computations in each step, Stochastic Gradient Descent came to be preferred in these cases. Here, updates to the weights are done as each sample is processed and, as such, subsequent calculations already use "improved" weights. Nonetheless, this very reason leads to it incurring in some misdirection in minimizing the error function (Fig. 2).

As such, in many situations it is preferred to use Mini-batch Gradient Descent, combining the best of both worlds: each update to the weights is done using a small batch of the data. This way, the direction of the updates is somewhat rectified in comparison with the stochastic updates, but is updated much more regularly than in the case of the (original) Gradient Descent.

[UPDATE] As requested, I present below the pseudocode for batch gradient descent in binary classification:

error = 0

for sample in data:
    prediction = neural_network.predict(sample)
    sample_error = evaluate_error(prediction, sample["label"]) # may be as simple as 
                                                # module(prediction - sample["label"])
    error += sample_error

neural_network.backpropagate_and_update(error)

(In the case of multi-class labeling, error represents an array of the error for each label.)

This code is run for a given number of iterations, or while the error is above a threshold. For stochastic gradient descent, the call to neural_network.backpropagate_and_update() is called inside the for cycle, with the sample error as argument.

Gradient Descent vs Stochastic Gradient Descent algorithms

2 Answers