I am training a recurrent neural network on LibriSpeech. I have tried different variations regarding learning rate, batch size, etc. In every training, one thing was similar, that validation loss gets saturated after 7 epochs. I thought that this might be due to overfitting. But, I noticed a weird behaviour, that after resetting the Adam optimizer, i.e., it's slot variables m and v, after 7 epochs of training, the validation loss decreases to a new lower minima than the previous one and then starts oscillating around that value for the rest of the training. I am speculating that due to longer periods of training, slot v variables become much smaller compared to slot m variables. So after resetting them, this unknown behaviour happens. I am not sure though. So, do we need to reset Adam optimizer after every fixed number of steps? Or if not, then why the validation loss decreases to a new lower minima? I am using the default values of beta_1, beta_2 and epsilon for Adam optimizer in Tensorflow
2 Answers
Not sure what is creating the behavior but I believe you might avoid it by using an adjustable learning rate. The keras callback ReduceLROnPlateau makes that easy to do. Documentation is here. Set it up to monitor validation loss and it will automatically lower the learning rate by a specified factor if the validation loss fails to decrease over a specified number(patience) of consecutive epochs. I use a factor of .6 and a patience value of 1. Give it a try and hopefully your validation loss will achieve a lower level without resetting the optimizer.
Resetting the window is definitely not standard practice (at least from my knowledge). In your case, resetting it might lead to a short-term increase in the effective learning rate due to the reset momentum which would explain your two observations:
The network suddenly improves: the reset acts as a "kickstarter", i.e. the network is able to jump out of the previous local minimum. This indicates that it was previously stuck here, i.e. the learning rate was either too small (if the error curve is flat or decreasing very slowly) or too large (if it oscillates around this point)
The oscillations at the end: The increased step size leads to a new local minimum but it is now too large to actually get out of it, therefore oscillating.
While you have made an interesting observation, it would require more research to validate that this is consistent behavior and not just anecdotally. In general (and your case), I would always recommend to run a gridsearch on learning rate and batch size (remember to account for random initial weights by repeating each configuration a couple of times) until you can find a suitable training curve that doesn't saturate or overfit too early while simultaneously not wasting too much training resources on very small but many gradient updates. Both early stopping and learning rate decay can be helpful with this, even if you're using Adam. These are established and well-researched practices that will almost always work, given you have some patience for the tuning process.