0
votes

I ask tensorflow to save models every 100 iterations in every epoch, the following is my code. But after 900 iterations, only trained models for the 500th, 600th, 700th, 800th, 900th iterations were saved.

with tf.Session(config = tf.ConfigProto(log_device_placement = True)) as sess:

    sess.run(init_op)

    for i in range(args.num_epochs):
        start_time = time.time()
        k = 0
        acc_train = 0
        # initialize the iterator to train_dataset
        sess.run(train_init_op)
        while True:
            try:
                accu, l, _ = sess.run([accuracy, loss, optimizer], feed_dict = {training: True})
                k += 1
                acc_train += accu
                if k % 100 == 0:
                    print('Epoch: {}, step: {}, training loss: {:.3f}, training accuracy: {:.2f}%'.format(i, k, l, accu * 100))
                    saver.save(sess, args.saved_model_path, global_step = (i+1) * k)
            except tf.errors.OutOfRangeError:
                break

The following is the training accuracies:

Epoch: 0, step: 100, training loss: 0.669, training accuracy: 59.38%

Epoch: 0, step: 200, training loss: 0.806, training accuracy: 54.69%

Epoch: 0, step: 300, training loss: 0.781, training accuracy: 57.81%

Epoch: 0, step: 400, training loss: 0.725, training accuracy: 64.06%

Epoch: 0, step: 500, training loss: 0.347, training accuracy: 89.06%

Epoch: 0, step: 600, training loss: 0.193, training accuracy: 89.06%

Epoch: 0, step: 700, training loss: 0.003, training accuracy: 100.00%

Epoch: 0, step: 800, training loss: 0.190, training accuracy: 98.44%

Epoch: 0, step: 900, training loss: 0.009, training accuracy: 100.00%

My question is why tensorflow did not saved models for the 100th, 200th, 300th, 400th iterations? Thank you!

1

1 Answers

2
votes

It did, but I'm guessing the Saver instance you created had the default max_keep value of 5, so it overwrote them as the last 5 were created. To keep 10, change your saver creation line to

saver = tf.train.Saver(max_keep=10)

You might also want to play with the keep_checkpoint_every_n_hours argument if you don't want to save -every- one.