Image Classification. Validation loss stuck during training with inception (v1)

Question

I have built a small custom image classification training/val dataset with 4 classes. The training dataset has ~ 110.000 images. The validation dataset has ~ 6.000 images.

The problem I'm experiencing is that, during training, both training accuracy (measured as an average accuracy on the last training samples) and training loss improve, while validation accuracy and loss stay the same.

This only occurs when I use inception and resnet models, if I use an alexnet model on the same training and validation data, the validation loss and accuracy improve

In my experiments I am employing several convolutional architectures by importing them from tensorflow.contrib.slim.nets

The code is organized as follows:

...

images, labels = preprocessing(..., train=True)
val_images, val_labels = preprocessing(..., train=False)

...

# AlexNet model
with slim.arg_scope(alexnet.alexnet_v2_arg_scope()):
    logits, _ = alexnet.alexnet_v2(images, ..., is_training=True)
    tf.get_variable_scope().reuse_variables()
    val_logits, _ = alexnet.alexnet_v2(val_images, ..., is_training=False)

# Inception v1 model
with slim.arg_scope(inception_v1_arg_scope()):
    logits, _ = inception_v1(images, ..., is_training=True)
    val_logits, _ = inception_v1(val_images, ..., is_training=False, reuse=True)

loss = my_stuff.loss(logits, labels)
val_loss = my_stuff.loss(val_logits, val_labels)

training_accuracy_op = tf.nn.in_top_k(logits, labels, 1)
top_1_op = tf.nn.in_top_k(val_logits, val_labels, 1)
train_op = ...

...

Instead of using a separate eval script, I'm running the validation step at the end of each epoch and also, for debugging purposes, I'm running an early val step (before training) and I'm checking the training accuracies by averaging training predictions on the last x steps.

When I use the Inception v1 model (commenting out the alexnet one) the logger output is as follows after 1 epoch:

early Validation Step
precision @ 1 = 0.2440 val loss = 1.39
Starting epoch 0
step 50, loss = 1.38, training_acc = 0.3250
...
step 1000, loss = 0.58, training_acc = 0.6725
...
step 3550, loss = 0.45, training_acc = 0.8063
Validation Step
precision @ 1 = 0.2473 val loss = 1.39

As shown, training accuracy and loss improve a lot after one epoch, but the validation loss doesn't change at all. This has been tested at least 10 times, the result is always the same. I would understand if the validation loss was getting worse due to overfitting, but in this case it's not changing at all.

To rule out any problems with the validation data, I'm also presenting the results while training using the AlexNet implementation in slim. Training with the alexnet model produces the following output:

early Validation Step
precision @ 1 = 0.2448 val loss = 1.39
Starting epoch 0
step 50, loss = 1.39, training_acc = 0.2587
...
step 350, loss = 1.38, training_acc = 0.2919
...
step 850, loss = 1.28, training_acc = 0.3898
Validation Step
precision @ 1 = 0.4069 val loss = 1.25

Accuracy and validation loss, both in training and test data, correctly improve when using the alexnet model, and they keep improving in subsequent epochs.

I don't understand what may be the cause of the problem, and why it presents itself when using inception/resnet models, but not when training with alexnet.

Does anyone have ideas?

user3897060 user3897060 · Accepted Answer · 2017-08-10T12:46:24

After searching through forums, reading various threads and experimenting I found the root of the problem.

Using a train_op which was basically recycled from another example was the problem, it worked well with the alexnet model, but didn't work on other models since it was lacking batch normalization updates.

To fix this i had to use either

optimizer = tf.train.GradientDescentOptimizer(0.005)
train_op = slim.learning.create_train_op(total_loss, optimizer)

or

train_op = tf.contrib.layers.optimize_loss(total_loss, global_step, .005, 'SGD')

This seems to take care of the batchnorm updates being done.

The problem still persisted for short training runs because of the slow moving averages updates.

The default slim arg_scope had the decay set to 0.9997, which is stable but apparently needs many steps to converge. Using the same arg_scope but with decay set to 0.99 or 0.9 did help in this short training scenario.

Image Classification. Validation loss stuck during training with inception (v1)

2 Answers