I have a Keras model which trains fine on a single GPU but when I train it on multiple gpus all of the validation losses returned for training are NaNs.
I'm using a fit_generator and make a call to a validation generator. The values returned for training losses and validation losses when training on one GPU are both valid and my model converges but on 2 or more GPUs the training losses are fine and valid but the validation losses are all NaNs. Is this a problem anyone has encountered before and does anyone have any advice on how to remedy the problem? I've tried the code on multiple computers each with different numbers and varities of Keras/Tensorflow compatible CUDA GPUs but to no avail. I'm able to successfully train on any computer though when using only one GPU.
model = multi_gpu_model(Model(inputs=inputs, outputs=outputs),gpus=number_of_gpus, cpu_merge=True, cpu_relocation=False)
hist = model.fit_generator(generator=training_generator,
callbacks=callbacks,
max_queue_size=max_queue_size,
steps_per_epoch=steps_per_epoch,
workers=number_of_workers,
validation_data = validation_generator,
validation_steps=validation_steps,
shuffle=False)
My expectation was that the model would return valid validation losses but instead every single validation loss is NaN so I can't accurately benchmark my training on a multiple GPU machine which is incredibly inconvenient because I'm looking to accelerate my training speed.