0
votes

I am training a recurrent neural network on LibriSpeech. I have tried different variations regarding learning rate, batch size, etc. In every training, one thing was similar, that validation loss gets saturated after 7 epochs. I thought that this might be due to overfitting. But, I noticed a weird behaviour, that after resetting the Adam optimizer, i.e., it's slot variables m and v, after 7 epochs of training, the validation loss decreases to a new lower minima than the previous one and then starts oscillating around that value for the rest of the training. I am speculating that due to longer periods of training, slot v variables become much smaller compared to slot m variables. So after resetting them, this unknown behaviour happens. I am not sure though. So, do we need to reset Adam optimizer after every fixed number of steps? Or if not, then why the validation loss decreases to a new lower minima? I am using the default values of beta_1, beta_2 and epsilon for Adam optimizer in Tensorflow

2
This is probably not well suited as a question for SO as this probably requires empirical research to validate that this is consistent behavior and not just something that you anecdotally observed. I can at least tell you that resetting the window is definitely not standard practice (at least from my knowledge). In your case, resetting the window might lead to a short-term increase in the effective learning rate due to the reset momentum which would explain your observations, including the oscillations at the end.runDOSrun
Hi, yes you are right, it is not an empirical behaviour. I am getting this issue with my current neural network. Previously, I have never encountered such an anomaly with Adam. Yes, the effective learning rate should increase due to resetting the momentum. Thanks for the intuition here. This explains why validation loss decreases suddenly. Also, does this mean that training my model with a more careful learning rate should get me the reduced validation loss without resetting the optimizer?Tushar Vatsal

2 Answers

1
votes

Not sure what is creating the behavior but I believe you might avoid it by using an adjustable learning rate. The keras callback ReduceLROnPlateau makes that easy to do. Documentation is here. Set it up to monitor validation loss and it will automatically lower the learning rate by a specified factor if the validation loss fails to decrease over a specified number(patience) of consecutive epochs. I use a factor of .6 and a patience value of 1. Give it a try and hopefully your validation loss will achieve a lower level without resetting the optimizer.

1
votes

Resetting the window is definitely not standard practice (at least from my knowledge). In your case, resetting it might lead to a short-term increase in the effective learning rate due to the reset momentum which would explain your two observations:

  • The network suddenly improves: the reset acts as a "kickstarter", i.e. the network is able to jump out of the previous local minimum. This indicates that it was previously stuck here, i.e. the learning rate was either too small (if the error curve is flat or decreasing very slowly) or too large (if it oscillates around this point)

  • The oscillations at the end: The increased step size leads to a new local minimum but it is now too large to actually get out of it, therefore oscillating.

While you have made an interesting observation, it would require more research to validate that this is consistent behavior and not just anecdotally. In general (and your case), I would always recommend to run a gridsearch on learning rate and batch size (remember to account for random initial weights by repeating each configuration a couple of times) until you can find a suitable training curve that doesn't saturate or overfit too early while simultaneously not wasting too much training resources on very small but many gradient updates. Both early stopping and learning rate decay can be helpful with this, even if you're using Adam. These are established and well-researched practices that will almost always work, given you have some patience for the tuning process.