3
votes

I am writing neural network code in tensorflow. I made it to save variables in every 1000 epoch. So, I expect to save variables of 1001th epoch, 2001th epoch, 3001th epoch ... for different files. The code below is the save function I made.

def save(self, epoch):
    model_name = "MODEL_save"
    checkpoint_dir = os.path.join(model_name)

    if not os.path.exists(checkpoint_dir):
        os.makedirs(checkpoint_dir)
    self.saver.save(self.sess, checkpoint_dir + '/model', global_step=epoch)
    self.saver.save(self.sess, checkpoint_dir + '/model')
    print("path for saved %s" % checkpoint_dir)

I made this code to save two times once the function is called. Because I wanted to save history of variables for every 1000 epoch by using 'global_step=epoch'. And wanted to save latest variables in the file without epoch specified. I call this function whenever the epoch condition is met like below.

for epoch in xrange(self.m_total_epoch):

    .... CODE FOR NEURAL NETWORK ....

    if epoch%1000 == 1 and epoch != 1:
        self.save(epoch)

Assuming current epoch is 29326, I expect all the saved files in the directory from 1001, 2001, 3001 ... 29001. However, there are only partial of files from 26001, 27001, 28001, 29001. I checked it happened in other computers. It is different from what I expected. Why does it happen?

1

1 Answers

5
votes

tf.train.Saver has a max_to_keep argument in its constructor that keeps only the latest models saved. And this max_to_keep argument, somewhat suprisingly, has a default value of 5. So by default, you will only have the latest 5 models.

To keep all models, set this variable to None:

saver = tf.train.Saver(max_to_keep=None)