Why does the learning rate increase in Adam?

Question

I have been using the following piece of code to print the lr_t learning_rate in Adam() optimizer for my trainable_model.

if(np.random.uniform()*100 < 3 and self.training):
    model = self.trainable_model
    _lr    = tf.to_float(model.optimizer.lr, name='ToFloat')
    _decay = tf.to_float(model.optimizer.decay, name='ToFloat')
    _beta1 = tf.to_float(model.optimizer.beta_1, name='ToFloat')
    _beta2 = tf.to_float(model.optimizer.beta_2, name='ToFloat')
    _iterations = tf.to_float(model.optimizer.iterations, name='ToFloat')
    t = K.cast(_iterations, K.floatx()) + 1
    _lr_t = lr * (K.sqrt(1. - K.pow(_beta2, t)) /  (1. - K.pow(_beta1, t)))
    print(" - LR_T: "+str(K.eval(_lr_t)))

What I don't understand is that this learning rate increases. (with decay at default value of 0).

If we look at the learning_rate equation in Adam, we find this:

 lr_t = lr * (K.sqrt(1. - K.pow(self.beta_2, t)) /
                 (1. - K.pow(self.beta_1, t)))

which corresponds to the equation (with default values for parameters):

= 0.001*sqrt(1-0.999^x)/(1-0.99^x)

If we print this equation we obtain :

which clearly shows that the learning_rate is increasing exponentially over time (since t starts at 1)

can someone explain why this is the case ? I have read everywhere that we should use a learning_rate that decays over time, not increase.

Does it means that my neural network makes bigger updates over time as Adam's learning_rate increases ?

These equations for the learning rate are incomplete, you are not considering the division by the running mean of the squared gradient. — Dr. Snoopy
do you mean that after doing this division the actual learning rate may actually be decreasing ? — Zhell
No, I mean that your equations are incorrect, so you are drawing incorrect conclusions. — Dr. Snoopy
This equation is right from keras so I don't think it is incorrect, but maybe it's incomplete for what you are talking about. My "conclusion" is that the learning rate increases, so if it is incorrect it implies that the learning rate decreases, yet you tell me that this is not what you mean so I don't get it. Can you explain a bit more please ? — Zhell

marco romelli marco romelli · Accepted Answer · 2019-06-04T08:42:41

Looking at the source code of the Adam optimizer in Keras, it looks like the actual "decay" is performed at: this line. The code you reported is executed only after and is not the decay itself.
If the question is "why it is like that" I would suggest you to read some theory about Adam like the original paper.

EDIT
It should be clear that the update equation of the Adam optimizer does NOT include a decay by itself. The decay should be applied separately.

Why does the learning rate increase in Adam?

1 Answers