2
votes

Using Tensorflow's new Dataset API for multi-GPU training (from TFRecords format) appears to perform considerably slower (1/4 slower) than running on a single GPU (1 vs. 4 Tesla K80s).

Looking at the output of nvidia-smi it appears that using 4 GPUs only causes gpu-utilization to be around 15% each, while with a single GPU it is around 45%.

Does loading data from disk (tfrecords-format) cause a bottleneck in the training speed? Using regular feed-dicts, where the entire dataset is loaded into memory is substantially faster than using the dataset API as well.

1

1 Answers

0
votes

It seems your network is throttled by:

  1. IO from the disc, as mentioned in your last paragraph If you are starting your Dataset with reading off TFRecords, then it will read from disc; Instead, you could read them into a list/dict, and start with range sequence. Eg.

tf.data.Dataset()\ .range(your_data_size)\ .prefetch(20)\ .shuffle(buffer_size=20)\ .map(lambda i: your_loaded_list[i], num_parallel_calls=8)

  1. Heavy pre/post-processing, as mentioned in your 2nd paragraph where single GPU utilization is 45%; if that was when you already pre-load data to memory, it suggests your network taking efforts outside of the "main" computation body.

First you may work to check if using multi-threading with the map call like above helps; also trimming down some tf.summary operations which could potentially feed back lots of unnecessary data which throttles your bandwidth/write to disc afterwards.

Hope this helps.