My model requires to run many epochs in order to get decent result, and it takes few hours using v100 on Google Cloud.
Since I'm on a preemptible instance, it kicks me off in the middle of training. I would like to be able to resume from where it left off.
In my custom CallBack, I run self.model.save(...) in on_epoch_end. Also it stops the training if the score hasn't improved in last 50 epochs.
Here are the steps I tried:
- I ran model.fit until the early stops kicked in after epoch 250 (best score was at epoch 200)
- I loaded the model saved after 100th epoch.
- I ran model.fit with initial_epoch=100. (It starts with Epoch 101.)
However, it takes while to catch up with the first run. Also the accuracy score of each epoch gets kind of close to the first run, but it's lower. Finally the early stop kicked in at like 300, and the final score is lower than the first run. Only way I can get the same final score is to create the model from scratch and run fit from the epoch 1.
I also tried to utilize float(K.get_value(self.model.optimizer.lr)) and K.set_value(self.model.optimizer.lr, new_lr). However, self.model.optimizer.lr always returned the same number. I assume it's because the adam optimizer calculates the real lr from the initial lr that I set with Adam(lr=1e-4).
I'm wondering what's the right approach to resume training using Adam optimizer?