Stochastic Gradient Descent algorithms with mini batches usually use mini batches' size or count as a parameter.
Now what I'm wondering, do all of the mini-batches need to be of exact same size?
Take for example a training data from MNIST(60k training images) and a mini-batch size of 70.
If we are going in a simple loop, that produces us 857 mini-batches of size 70 (as specified) and one mini-batch of size 10.
Now, does it even matter that (using this approach) one mini-batch will be smaller than the others (worst case scenario here: mini batch of size 1)? Will this strongly affect the weights and biases that our network has learned over almost all its' training?