1
votes

Stochastic Gradient Descent algorithms with mini batches usually use mini batches' size or count as a parameter.

Now what I'm wondering, do all of the mini-batches need to be of exact same size?

Take for example a training data from MNIST(60k training images) and a mini-batch size of 70.

If we are going in a simple loop, that produces us 857 mini-batches of size 70 (as specified) and one mini-batch of size 10.

Now, does it even matter that (using this approach) one mini-batch will be smaller than the others (worst case scenario here: mini batch of size 1)? Will this strongly affect the weights and biases that our network has learned over almost all its' training?

1

1 Answers

4
votes

No, mini batches do not have to be the same size. They are usually constant sized for efficiency reasons (you do not have to reallocate memory/resize tensors). In practise you could even sample size of the batch in each iteration.

However, the size of the batch makes a difference. It is hard to say which one is the best, but using smaller/bigger batch sizes can result in different solutions (and always - different convergence speed). This is an effect of dealing with more stochastic motion (small batch) vs smooth updates (good gradient estimators). In particular - doing stochastic size of a batch with some predefined distribution of sizes can be used to use both effects at the same time (but time spent fitting this distribution might be not worth it)