8
votes

I would like to know what is considered "best practice" for multi-GPU systems when training networks with TensorFlow.

E.g., one of my networks looks like this:

                         input
                           |
                         (...) <-- convolutional layers
                           |
                       _________
    fully-connected    |       |    fully-connected
    output stream 1 -> |       | <- output stream 2

Does TensorFlow allocate multiple GPUs efficiently? Or should I specify myself which GPU TensorFlow should use for a specific operation?

I have not benchmarked it by now, just started some GPU experiments today. However, at the moment I have not specified which device to use on the convolutional layers but I did specify it for the fully-connected layers:

# flattened information of the last convolutional layer
h_pooln_flat = tf.reshape(...)

with tf.device("/gpu:0"):
    # stream 1 stuff

with tf.device("/gpu:1"):
    # stream 2 stuff

Is this a good idea? Or should leave resource allocation open to TensorFlow?

I guess one single "stream" of convolutional layers can not be computed in parallel?! So it does not matter which device does the convolution-, pooling-, ... part?!

Any tips to get the best performance?

Currently I am training on one node of a Slurm cluster with 2 GPUs but potentially I could train on more nodes, so 4, 6 or even 8 GPUs. However, I guess there would be much overhead with more than 2 GPUs?


EDIT (slow multi-GPU performance): After some tests I am quite astonished...if I let TensorFlow decide what to allocate and remove the device-specific statements the network trains considerably faster. This was really surprising to me...what could be more effective than having each output stream on one GPU when there are two GPUs total? Additionally it seems (according to the output) that Tensorflow is only using one GPU?!


EDIT2 (NaN values): After some more tests I experienced that my manual setup of gpu:0 for stream 1 and gpu:1 for stream 2 is not only slower than letting TensorFlow decide what to use (and according to the piped script output TensorFlow just uses one GPU) but also sometimes my (I do not know why) my "gpu:0 for stream 1 and gpu:1 for stream 2"-solution just generates NaN values. Like directly or short after the init. Very weird.

Does TensorFlow need some kind of thread locking or manual copy of input data for multiple GPUs?

1
I can't answer your question, but I can point out that in Tensorflow's documentation, they mention that allocation of processors (GPUs and CPUs) are done in a greedy method after allocating user defined placement constraints. Here is the white paper: download.tensorflow.org/paper/whitepaper2015.pdf . See sections 3.2 and 4.3. I'll be curious to see any answers as to the best practices as well.nfmcclure
All the data transfers are done for you, and you don't need to lock input data to prevent NaNs. But you can also get NaN if your optimization divergesYaroslav Bulatov
Yeah, but I never got the NaN problem with my network on a single GPU. I mean in 5 out of 5 experiments it converged normally on a single GPU but in 3 out 5 multi GPU runs I got NaN values. Additionally: why should multi GPU be slower? I mean due to data transfer between the GPUs I did not expect twice the speed but slower?daniel451

1 Answers

6
votes

The logic for default placement of devices lies in simple_placer.cc

I may be missing something in the logic, but from this line it seems that it will put all GPU ops on gpu:0

You can see from implementation that placement strategy doesn't take into account data transfer or computation costs, so manual placement is often better than automatic. For instance, if you are doing some kind of input pipeline, default placement usually places some data processing ops on GPU which makes things slower overall.

As far as your implementation being slow...perhaps there's gpu0->gpu1 copy happening somewhere?

Getting multi-GPU setups to work is very much an open area, let us know what you find!