Effect of Data Parallelism on Training Result

Question

I'm currently trying to implement multi-GPU training with the Tensorflow network. One solution for this would be to run one model per GPU, each having their own data batches, and combine their weights after each training iteration. In other words "Data Parallelism".

So for example if I use 2 GPUs, train with them in parallel, and combine their weights afterwards, then shouldn't the resulting weights be different compared to training with those two data batches in sequence on one GPU? Because both GPUs have the same input weights, whereas the single GPU has modified weights for the second batch.

Is this difference just marginal, and therefore not relevant for the end result after many iterations?

This is why sometimes we need to randomly shuffle the input. — yuefengz
@yuefengz, isn't a case that we should shuffle the data(to ensure IID property) so that it's passed to the model randomly during training? — Anu

yuefengz yuefengz · Accepted Answer · 2016-11-17T09:12:22

The order of the batches fed into training makes some difference. But the difference may be small if you have large number of batches. Each batch pulls the variables in the model a bit towards the minimum of the loss. The different order may make the path towards minimum a bit different. But as long as the loss is decreasing, your model is training and its evaluation becomes better and better.

Sometimes, to avoid the same batches "pull" the model together and avoid being too good only for some input data, the input for each model replica would be randomly shuffled before feeding into the training program.

Effect of Data Parallelism on Training Result

1 Answers