I would like to know what is considered "best practice" for multi-GPU systems when training networks with TensorFlow.
E.g., one of my networks looks like this:
input
|
(...) <-- convolutional layers
|
_________
fully-connected | | fully-connected
output stream 1 -> | | <- output stream 2
Does TensorFlow allocate multiple GPUs efficiently? Or should I specify myself which GPU TensorFlow should use for a specific operation?
I have not benchmarked it by now, just started some GPU experiments today. However, at the moment I have not specified which device to use on the convolutional layers but I did specify it for the fully-connected layers:
# flattened information of the last convolutional layer
h_pooln_flat = tf.reshape(...)
with tf.device("/gpu:0"):
# stream 1 stuff
with tf.device("/gpu:1"):
# stream 2 stuff
Is this a good idea? Or should leave resource allocation open to TensorFlow?
I guess one single "stream" of convolutional layers can not be computed in parallel?! So it does not matter which device does the convolution-, pooling-, ... part?!
Any tips to get the best performance?
Currently I am training on one node of a Slurm cluster with 2 GPUs but potentially I could train on more nodes, so 4, 6 or even 8 GPUs. However, I guess there would be much overhead with more than 2 GPUs?
EDIT (slow multi-GPU performance): After some tests I am quite astonished...if I let TensorFlow decide what to allocate and remove the device-specific statements the network trains considerably faster. This was really surprising to me...what could be more effective than having each output stream on one GPU when there are two GPUs total? Additionally it seems (according to the output) that Tensorflow is only using one GPU?!
EDIT2 (NaN values): After some more tests I experienced that my manual setup of gpu:0
for stream 1 and gpu:1
for stream 2 is not only slower than letting TensorFlow decide what to use (and according to the piped script output TensorFlow just uses one GPU) but also sometimes my (I do not know why) my "gpu:0
for stream 1 and gpu:1
for stream 2"-solution just generates NaN values. Like directly or short after the init. Very weird.
Does TensorFlow need some kind of thread locking or manual copy of input data for multiple GPUs?