1
votes

When using one of adaptive optimizers (Adam, etc.) we expect changing learning rate for successive mini-batches during training inside epoch. But I wonder how the learning rate would change between successive epochs - would it be continued from previous epoch (expected behavior) or initialized from default?

Of course by term "rate" I mean the whole bunch of variables which particular optimizer uses to determine the actual weights update wrt gradient)

Also what would happen to the rate if I run training for N epochs, stop and then continue like this:

model.fit(data1_train_x,data1_train_y, \

          initial_epoch=0, \
          epochs=20, \             

          validation_split=0.1,\
          batch_size=64, \
          callbacks=[tensorboard])

model.fit(data2_train_x,data2_train_y, \

          initial_epoch=20, \
          epochs=40, \

          validation_split=0.1,\
          batch_size=64, \              
          callbacks=[tensorboard])

I think I"ll create callback to log the rate after each epoch and plot it, but before I do it, may be someone already has the answers.

1
you can use established keras callbacks to mod your learning rate based on the epoch number. you may be able to do the same thing to your optimizer hyperparams as well although I have not tried.user1269942

1 Answers

1
votes

Summary

Rate changes do not reset; they continue smoothly across epochs in both cases.

Detail

Any well-behaved learning-rate decay function depends on the length of training, since iteration 0.

Note: you can write your own decay function; you can make it as deranged as you wish. One such alteration is

alpha = iteration_number

this will diverge before you get back with your coffee.

Some functions merely depend on the current state and a modifier, such as

if iteration_number % 5000 == 0:
    alpha *= 0.9

Another consists of a semi-exponential decay, depending on the quantity of remaining iterations.

In any case, these do not reset at the start of every epoch. You can write one to reset, if you wish, but I don't recommend it.

Your two-stage example is no exception, because you've coded it properly: you have the second training segment start where the previous one left off. The critical clue here is the initial_epoch parameter: you're telling the fitting function where to start the learning rate, rather than resetting to time zero.