I'm encountered the problem, that I can not successfully split my training batches to more than one GPU. If multi_gpu_model from tensorflow.keras.utils is used, tensorflow allocates the full memory on all available (for example 2) gpus, but only the first one (gpu[0]) is utilized to 100% if nvidia-smi is watched.
I'm using tensorflow 1.12 right now.
Test on single device
model = getSimpleCNN(... some parameters)
model .compile()
model .fit()
As expected, data is loaded by cpu and the model runs on gpu[0] with 97% - 100% gpu utilization:

Create a multi_gpu model
As described in the tensorflow api for multi_gpu_model here, the device scope for model definition is not changed.
from tensorflow.keras.utils import multi_gpu_model
model = getSimpleCNN(... some parameters)
parallel_model = multi_gpu_model(model, gpus=2, cpu_merge=False) # weights merge on GPU (recommended for NV-link)
parallel_model.compile()
parallel_model.fit()
As seen in the timeline, cpu now not only loads the data, but is doing some other calculations. Notice: the second gpu is nearly doing nothing:

The question
The effect even worsens as soon as four gpus are used. Utilization of the first one goes up to 100% but for the rest there are only short peeks.
Is there any solution to fix this? How to properly train on multiple gpus?
Is there any difference between tensorflow.keras.utils and keras.utils which causes the unexpected behavior?