1
votes

I have a Keras model which trains fine on a single GPU but when I train it on multiple gpus all of the validation losses returned for training are NaNs.

I'm using a fit_generator and make a call to a validation generator. The values returned for training losses and validation losses when training on one GPU are both valid and my model converges but on 2 or more GPUs the training losses are fine and valid but the validation losses are all NaNs. Is this a problem anyone has encountered before and does anyone have any advice on how to remedy the problem? I've tried the code on multiple computers each with different numbers and varities of Keras/Tensorflow compatible CUDA GPUs but to no avail. I'm able to successfully train on any computer though when using only one GPU.

model = multi_gpu_model(Model(inputs=inputs, outputs=outputs),gpus=number_of_gpus, cpu_merge=True, cpu_relocation=False)

            hist = model.fit_generator(generator=training_generator,
                                               callbacks=callbacks,
                                               max_queue_size=max_queue_size,
                                               steps_per_epoch=steps_per_epoch,
                                               workers=number_of_workers,
                                               validation_data = validation_generator,
                                               validation_steps=validation_steps,
                                               shuffle=False)

My expectation was that the model would return valid validation losses but instead every single validation loss is NaN so I can't accurately benchmark my training on a multiple GPU machine which is incredibly inconvenient because I'm looking to accelerate my training speed.

1
What are your training and validation batch sizes and are they multiples of the number of gpus? If the batch size is divided unequally among the gpu's you might get a NaN and this could also happen if any batch has lesser no of elements that your batch size - kvish
My training and validation batch sizes are 8 which shouldn't have been a problem then cause I was trying to use 4 and 2 gpus. Each batch is 5 separate arrays so that could be a problem. I'll try that out and report back to you. - EfelBaum
Nope this did not help. - EfelBaum
I was just looking through the Keras source code to see what might be the issue. I am not sure where the problem is. I still think it is trying to take an average of the metric and is encountering NaN, if training and validation is smooth when using 1 GPU. Does the problem repeat for other values of queue size and no of workers? - kvish

1 Answers

3
votes

As far as I can (heuristically) tell, when doing distributed training/evaluation, the number of elements in the dataset must be evenly divisible by the batch size and number of GPUs. That is, nelements / ngpus / batch_size == 0. If that is not the case, then empty batches will be passed to the loss function, which, depending on the loss function, may inject NaN losses into the aggregator.

(In the comments, the OP mentioned that their batch size is evenly divisible by the number of GPUs, which is not the same as the number of elements being divisible by the number of GPUs and batch size.)

I've encountered this problem writing a custom Keras model and using TF2 nightly. My workaround (which has solved my problem) is to modify any loss functions so that they explicitly check the size of the batch. E.g. assuming some error function named fn:

def loss(y_true, y_pred):
    err = fn(y_true, y_pred)
    loss = tf.cond(
        tf.size(y_pred) == 0,
        lambda: 0.,
        lambda: tf.math.reduce_mean(err)
    )
return loss

Another workaround would be to truncate the dataset.