Nan loss in keras with triplet loss

Question

I'm trying to learn an embedding for Paris6k images combining VGG and Adrian Ung triplet loss. The problem is that after a small amount of iterations, in the first epoch, the loss becomes nan, and then the accuracy and validation accuracy grow to 1.

I've already tried lowering the learning rate, increasing the batch size (only to 16 beacuse of memory), changing optimizer (Adam and RMSprop), checking if there are None values on my dataset, changing data format from 'float32' to 'float64', adding a little bias to them and simplify the model.

Here is my code:

base_model = VGG16(include_top = False, input_shape = (512, 384, 3))
input_images = base_model.input
input_labels = Input(shape=(1,), name='input_label')

embeddings = Flatten()(base_model.output)
labels_plus_embeddings = concatenate([input_labels, embeddings])

model = Model(inputs=[input_images, input_labels], outputs=labels_plus_embeddings)

batch_size = 16
epochs = 2
embedding_size = 64

opt = Adam(lr=0.0001)

model.compile(loss=tl.triplet_loss_adapted_from_tf, optimizer=opt, metrics=['accuracy'])

label_list = np.vstack(label_list)

x_train = image_list[:2500]
x_val = image_list[2500:]

y_train = label_list[:2500]
y_val = label_list[2500:]

dummy_gt_train = np.zeros((len(x_train), embedding_size + 1))
dummy_gt_val = np.zeros((len(x_val), embedding_size + 1))

H = model.fit(
    x=[x_train,y_train],
    y=dummy_gt_train,
    batch_size=batch_size,
    epochs=epochs,
    validation_data=([x_val, y_val], dummy_gt_val),callbacks=callbacks_list)

The images are 3366 with values scaled in range [0, 1]. The network takes dummy values because it tries to learn embeddings from images in a way that images of the same class should have small distance, while images of different classes should have high distances and than the real class is part of the training.

I've noticed that I was previously making an incorrect class division (and keeping images that should be discarded), and I didn't have the nan loss problem.

What should I try to do?

Thanks in advance and sorry for my english.

Louis T. Louis T. · Accepted Answer · 2020-05-17T14:40:16

In some case, the random NaN loss can be caused by your data, because if there are no positive pairs in your batch, you will get a NaN loss.

As you can see in Adrian Ung's notebook (or in tensorflow addons triplet loss; it's the same code) :

semi_hard_triplet_loss_distance = math_ops.truediv(
        math_ops.reduce_sum(
            math_ops.maximum(
                math_ops.multiply(loss_mat, mask_positives), 0.0)),
        num_positives,
        name='triplet_semihard_loss')

There is a division by the number of positives pairs (num_positives), which can lead to NaN.

I suggest you try to inspect your data pipeline in order to ensure there is at least one positive pair in each of your batches. (You can for example adapt some of the code in the triplet_loss_adapted_from_tf to get the num_positives of your batch, and check if it is greater than 0).

Nan loss in keras with triplet loss

2 Answers