2
votes

I'm trying to train multiple models in parallel on a single graphics card. To achieve that I need to resume training of models from saved weights which is not a problem. The model.fit() method has even a parameter initial_epoch that lets me tell the model which epoch the loaded model is on. However when i pass a TensorBoard callback to the fit() method in order to monitor the training of the models, on Tensorboard all data is shown on x=0.

Is there a ways to overcome this and adjust the epoch on tensorboard?

By the way: Im running Keras 2.0.6 and Tensorflow 1.3.0.

self.callbacks = [TensorBoardCallback(log_dir='./../logs/'+self.model_name, histogram_freq=0, write_graph=True, write_images=False, start_epoch=self.step_num)]
self.model.fit(x=self.data['X_train'], y=self.data['y_train'], batch_size=self.input_params[-1]['batch_size'], epochs=1, validation_data=(self.data['X_test'], self.data['y_test']), verbose=verbose, callbacks=self.callbacks, shuffle=self.hyperparameters['shuffle_data'], initial_epoch=self.step_num)
self.model.save_weights('./weights/%s.hdf5'%(self.model_name))
self.model.load_weights('./weights/%s.hdf5'%(self.model_name))
self.model.fit(x=self.data['X_train'], y=self.data['y_train'], batch_size=self.input_params[-1]['batch_size'], epochs=1, validation_data=(self.data['X_test'], self.data['y_test']), verbose=verbose, callbacks=self.callbacks, shuffle=self.hyperparameters['shuffle_data'], initial_epoch=self.step_num)
self.model.save_weights('./weights/%s.hdf5'%(self.model_name))

The resulting graph on Tensorboard looks like this which is not what i was hoping for: enter image description here

Update:

When passing epochs=10 to the first model.fit() the 10 epoch results are displayed in TensorBoard (see picture).

However when reloading the model and running it (with the same callback attached) the on_epoch_end method of the callback gets never called.

enter image description here

1
Which version of keras are you running? I'm looking at the master's branch and the Tensorboard callback doesn't a start_epoch parameter (instead, it is passed to Tensorboard#on_epoch_end by the caller. - ldavid
I'm running Keras 2.0.6 but as mentioned in the post the initial_epoch parameter is not passed to the callback but to the model.fit() method of keras. - Raspel
I know initial_epoch is fit's parameter, but the first line in your code snippet is self.callbacks = [TensorBoardCallback(... start_epoch=self.step_num)], which is weird. Another question about this parallel training: do these models have different names? Are there multiple directories inside logs? - ldavid
Oh yes you are right sorry. I tried to play a bit with the TensorBoard Callback itself and tried to replace the epoch in on_epoch_end() with my epoch number. Though that method doesnt seem to get called when i train a reloaded model. I did not mean to leave that in the code sorry. Yes every model has a different name and all of them are displayed on the Tensorboard graph but only for epoch 0. - Raspel
Don't worry. Back to the problem: I am thinking the models were being recreated (therefore, being renamed) before being resumed. If you want them to reapear as the same line in the graph, they must have the same name => be contained in the same subfolder. Still, from the graph, you don't seem to have trained any model for even one epoch. Try to train them for more than one epoch and check if a line (an not only a point) appears in the graph; drag smoothing to 0 and check step instead of relative at "Horizontal axis". Then upload the new graph to your question. - ldavid

1 Answers

3
votes

Turns out that when i pass the number of episodes to model.fit() to tell it how long to train, it has to be the number FROM the initial_epoch specified. So if initial_epoch=self.step_num then , epochs=self.step_num+10 if i want to train for 10 episodes.