Using Tensorflow's new Dataset API for multi-GPU training (from TFRecords format) appears to perform considerably slower (1/4 slower) than running on a single GPU (1 vs. 4 Tesla K80s).
Looking at the output of nvidia-smi it appears that using 4 GPUs only causes gpu-utilization to be around 15% each, while with a single GPU it is around 45%.
Does loading data from disk (tfrecords-format) cause a bottleneck in the training speed? Using regular feed-dicts, where the entire dataset is loaded into memory is substantially faster than using the dataset API as well.