Noisy training loss

Question

I am training encoder-decoder attention-based model, with batch size 8. I don't suspect too much noise in the dataset, however the examples come from a few different distributions.

I can see a lot of noise in the train loss curve. After averaging (.99), the tendency is fine. Also the accuracy of the model is not bad.

I'd like to understand what could be the reason of such shape of loss curve

The batch size is really small, try using 32 samples. The less samples in the batch size, the more importance is given to single samples, the more strong is the effect of outliers. — Daniele Grattarola
This is encoder-decoder attention-based model, so every example is in fact very complex example, with long sequence as input and different kind and length output. Bigger batch size doesn't fit top GPUs, but thank you — DavidS1992

DavidS1992 DavidS1992 · Accepted Answer · 2018-03-07T09:42:38

I found the answer myself.

I think other answers are not correct, because they are based on a experience with a simpler models/architectures. The main point that was bothering me was the fact that noise in losses usually is more symmetrical (you can plot the average and the noise is randomly over and below the average). Here, we see more like low-tendency path and sudden peaks.

As I wrote, the architecture I'm using is encoder-decoder with attention. It can easily concluded that inputs and outputs can have different lengths. The loss is summed over all time-steps, and DOESN'T need to be divided by the number of time-steps.

https://www.tensorflow.org/tutorials/seq2seq

Important note: It's worth pointing out that we divide the loss by batch_size, so our hyperparameters are "invariant" to batch_size. Some people divide the loss by (batch_size * num_time_steps), which plays down the errors made on short sentences. More subtly, our hyperparameters (applied to the former way) can't be used for the latter way. For example, if both approaches use SGD with a learning of 1.0, the latter approach effectively uses a much smaller learning rate of 1 / num_time_steps.

I was not averaging the loss, that's why the noise is observable.

P.S. Similarly the batch size of for example 8 can have a few hundred of inputs and targets so in fact you can't say that it is small or big not knowing the mean length of example.

Noisy training loss

3 Answers