Keras How To Resume Training With Adam Optimizer

Question

My model requires to run many epochs in order to get decent result, and it takes few hours using v100 on Google Cloud.

Since I'm on a preemptible instance, it kicks me off in the middle of training. I would like to be able to resume from where it left off.

In my custom CallBack, I run self.model.save(...) in on_epoch_end. Also it stops the training if the score hasn't improved in last 50 epochs.

Here are the steps I tried:

I ran model.fit until the early stops kicked in after epoch 250 (best score was at epoch 200)
I loaded the model saved after 100th epoch.
I ran model.fit with initial_epoch=100. (It starts with Epoch 101.)

However, it takes while to catch up with the first run. Also the accuracy score of each epoch gets kind of close to the first run, but it's lower. Finally the early stop kicked in at like 300, and the final score is lower than the first run. Only way I can get the same final score is to create the model from scratch and run fit from the epoch 1.

I also tried to utilize float(K.get_value(self.model.optimizer.lr)) and K.set_value(self.model.optimizer.lr, new_lr). However, self.model.optimizer.lr always returned the same number. I assume it's because the adam optimizer calculates the real lr from the initial lr that I set with Adam(lr=1e-4).

I'm wondering what's the right approach to resume training using Adam optimizer?

Manoj Mohan Manoj Mohan · Accepted Answer · 2019-06-03T10:08:27

I'm wondering what's the right approach to resume training using Adam optimizer?

As mentioned here: https://keras.io/getting-started/faq/#how-can-i-save-a-keras-model, model.save() followed by load_model() will take care of compiling the model using the saved training configuration.

if not os.path.exists('tf_keras_cifar10.h5'):
    model = get_model() #this method constructs the model and compiles it 
else:
    model = load_model('tf_keras_cifar10.h5') #load the model from file
    print('lr is ', K.get_session().run(model.optimizer.lr))
    initial_epoch=10
    epochs=13

history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs,validation_data=(x_test, y_test), initial_epoch=initial_epoch)
model.save('tf_keras_cifar10.h5')

At the end of initial run just before saving the model

Epoch 10/10 50000/50000 [==============================] - 13s 255us/sample - loss: 0.6257 - acc: 0.7853 - val_loss: 0.8886 - val_acc: 0.6985

Resuming from saved model:

Epoch 11/13 50000/50000 [==============================] - 15s 293us/sample - loss: 0.6438 - acc: 0.7777 - val_loss: 0.8732 - val_acc: 0.7083

Please check this issue as well related to resuming training using Adam Optimizer(tf.keras): https://github.com/tensorflow/tensorflow/issues/27049

The recommendation is to upgrade the TF version.

Keras How To Resume Training With Adam Optimizer

3 Answers