I'm confused regarding as to how the adam optimizer actually works in tensorflow.
The way I read the docs, it says that the learning rate is changed every gradient descent iteration.
But when I call the function I give it a learning rate. And I don't call the function to let's say, do one epoch (implicitly calling # iterations so as to go through my data training). I call the function for each batch explicitly like
for epoch in epochs
for batch in data
sess.run(train_adam_step, feed_dict={eta:1e-3})
So my eta cannot be changing. And I'm not passing a time variable in. Or is this some sort of generator type thing where upon session creation t
is incremented each time I call the optimizer?
Assuming it is some generator type thing and the learning rate is being invisibly reduced: How could I get to run the adam optimizer without decaying the learning rate? It seems to me like RMSProp is basically the same, the only thing I'd have to do to make it equal (learning rate disregarded) is to change the hyperparameters momentum
and decay
to match beta1
and beta2
respectively. Is that correct?