2
votes

I have tried training three UNet models using keras for image segmentation to assess the effect of multi-GPU training.

  1. First model was trained using 1 batch size on 1 GPU (P100). Each training step took ~254ms. (Note it is step, not epoch).
  2. Second model was trained using 2 batch size using 1 GPU (P100). Each training step took ~399ms.
  3. Third model was trained using 2 batch size using 2 GPUs (P100). Each training step took ~370ms. Logically it should have taken the same time as the first case, since both GPUs process 1 batch in parallel but it took more time.

Anyone who can tell whether multi-GPU training results in reduced training time or not? For reference, I tried all the models using keras.

1
You should look at, given the same model initialization, the total convergence time. Otherwise there might be many doubts about "what is a step" for a multigpu model, and also "what is an epoch".Daniel Möller
@DanielMöller : Could you please tell what do you mean by total convergence time?samra irshad
Do you mean by lowest validation error?samra irshad
Yes, the time the model take to reach what you expect from it. The answer Srihari put here seems to say something similar.Daniel Möller

1 Answers

3
votes

I presume that this is due to the fact that you use a very small batch_size; in this case, the cost of distributing the gradients/computations over two GPUs and fetching them back (as well as CPU to GPU(2) data distribution) outweigh the parallel time advantage that you might gain versus the sequential training(on 1 GPU).

Expect to see a bigger difference for a batch size of 8/16 for instance.