Why training speed does not scale with the batch size?

Question

I am surprised that increasing batch size does not increase the total processing speed on a GPU. My measurements:

batch_size=1: 0.33 sec/step
batch_size=2: 0.6 sec/step
batch_size=3: 0.8 sec/step
batch_size=4: 1.0 sec/step

My expectation was that the time for the step would remain (almost) constant thanks to parallelization on the GPU. However, it almost linearly scales with the batch size. Why? Did I misunderstood something?

I am using Tensorflow Object Detection API, retraining the pre-trained faster_rcnn_resnet101_coco model, the predefined batch_size is 1, our GPU (Nvidia 1080 Ti) could handle up to 4 images so I wanted to exploit this to accelerate the training.

Why is the question down-voted? Please, let me know whats wrong with the question. — lukas

ITiger ITiger · Accepted Answer · 2018-01-17T13:35:20

It's often wrongly mentioned that batch learning is as fast or faster than on-line training. In fact, batch-learning is changing the weights once, the complete set of data (the batch) has been presented to the network. Therefore, the weight update frequency is rather slow. This explains why the processing speed in your measurements acts like you observed.

To get a further understanding for the training techniques, have a look at the 2003 paper The general inefficiency of batch training for gradient descent learning. It deals with the comparison of batch and on-line learning.

Edit:

Regarding your comment:

I don't think there happens a model or data parallelization on one single GPU. The GPU parallelizes the vector and matrix operations that are involved in the training algorithm, but the batch learning algorithm is still computed as follows:

loop maxEpochs times
  for each training item
    compute weights and bias deltas for curr item
    accumulate the deltas
  end for
  adjust weights and bias deltas using accumulated deltas
end loop

As you can see, although the weight adjustment is only applied once for the whole batch, the weight and bias deltas still have to be computed for every element in the batch. Therefore there is IMHO no large performance advantage of the batch learning algorithm compared to the on-line learning.

Why training speed does not scale with the batch size?

2 Answers

Edit: