0
votes

In Neural Networks, Gradient Descent looks over the entire training set in order to calculate gradient. The cost function decreases over iterations. If cost function increases, it is usually because of errors or inappropriate learning rate.

Conversely, Stochastic Gradient Descent calculates gradient over each single training example. I'm wondering if it is possible that the cost function may increase from one sample to another, even though the implementation is correct and parameters are well tuned. I get a feeling that exceptional increments of the cost function are okay since gradient follows minimization of a single sample, that may no be the same direction of convergence of the overall system.

Are increments of the cost function expected in Stochastic Gradient Descent?

3
I think many people nowadays call it Stochastic Gradient (as it's not a strict descent method). - sascha

3 Answers

1
votes

in theory we are taught that gradient descent decreases over time if the model is not overfitting or underfitting. Nonetheless, in practice that is not completely true. In a more real-world optimization problem you will note that the cost function is actually very noisy. It will have a lot of peaks and seeing the actual decreasing trend becomes hard. In order to see the trend, you have to compute a moving average so the signal gets cleaner and you see whether the cost function is decreasing or increasing. Hope this helps.

0
votes
  • Noisy convergence wrt. to the loss function is often a consequence of Stochastic Gradient Descent.

  • Try using Minibatch Gradient Descent with a significant batch size. The loss plot smoothens out as the average gradients from different images are expected to lead in the optimal direction in the weight space.

0
votes

Stochastic Gradient Descent iterates in batches of training data by calculating an error gradient at the output node(s) and back-propagating those errors through the network with a learning rate < 1. This is a partial error function collected only over the batch subset, not the entire training set. The step in weight space is likely to reduce the error in the batch loss (guaranteed to do so in fact as long as the learning rate is sufficiently small), but this doesn't mean that it will reduce the loss function over the entire training set. There are no guarantees that a single step in weight space will improve aggregate loss across the full training set - this is entirely data-dependent.

It is absolutely possible that a single step in weight space will improve the batch loss metric at the expense of total error (effectively over-fitting a subset of the data), but when we repeat this over all the batched training samples, it will tend to move in the right direction with regard to aggregate error. This depends on the training rate though - if training rate is too high the network may continue to "bounce around" in the loss function without incremental convergence. If it's too low, then it may be very slow to converge.

(It's recommended to use an optimizer e.g. Adam which will adapt learning rates dynamically to manage this tradeoff for you).