Mini Batch Gradient Descent, adam and epochs

1

votes

I am taking a course on Deep Learning in Python and I am stuck on the following lines of an example:

regressor.compile(optimizer = 'adam', loss = 'mean_squared_error')
regressor.fit(X_train, y_train, epochs = 100, batch_size = 32)

From the definitions I know, 1 epoch = going through all training examples once to do one weight update.

batch_size is used in optimizer that divide the training examples into mini batches. Each mini batch is of size batch_size.

I am not familiar with adam optimization, but I believe it is a variation of the GD or Mini batch GD. Gradient Descent - has one big batch (all the data), but multiple epochs. Mini Batch Gradient Descent - uses multiple mini batches, but only 1 epoch.

Then, how come the code has both multiple mini batches and multiple epochs? Does epoch in this code has a different meaning then the definition above?

pythonmachine-learningregression

1

votes

assume you have 3200 examples to train your model. Then 1 epoch = going through 3200 training examples, but do 100 times back propagation if you set batch_size=32.

1

votes

Although the other answer basically already gives you the correct result, I would like to clarify on a few points you made in your post, and correct it.
The (commonly accepted) definitions of the different terms are as follows.

Gradient Descent (GD): Iterative method to find a (local or global) optimum in your function. Default Gradient Descent will go through all examples (one epoch), then update once.
Stochastic Gradient Descent (SGD): Unlike regular GD, it will go through one example, then immediately update. This way, you get a way higher update rate.
Mini Batching: Since the frequent updates of SGD are quite costly (updating the gradients is kind of tedious to perform), and can lead to worse results in certain circumstances, it is helpful to aggregate multiple (but not all) examples into one update. This means, you would go through n examples (where n is your batch size), and then update. This will still result in multiple updates within one epoch, but not necessarily as many as with SGD.
Epoch: One epoch simply refers to a pass through all of your training data. You can generally perform as many epochs as you would like.

One another note, you are correct about ADAM. It is generally seen as a more powerful variant of vanilla gradient descent, since it uses more sophisticated heuristics (first order derivatives) to speed up and stabilize convergence.

1

votes

Your understanding of epoch and batch_size seems correct.

Little more precision below.

An epoch corresponds to one whole training dataset sweep. This sweep can be performed in several ways.

Batch mode: Gradient of loss over the whole training dataset is used to update model weights. One optimisation iteration corresponds to one epoch.
Stochastic mode: Gradient of loss over one training dataset point is used to update model weights. If there are N examples in the training dataset, N optimisation iterations correspond to one epoch.
Mini-batch mode: Gradient of loss over a small sample of points from the training dataset is used to update model weights. The sample is of size batch_size. If there are N_examples examples in the training dataset, N_examples/batch_size optimisation iterations correspond to one epoch.

In your case (epochs=100, batch_size=32), the regressor would sweep the whole dataset 100 items, with mini data batches of size 32 (ie. Mini-batch mode).

If I assume your dataset size is N_examples, the regressor would perform N_examples/32 model weight optimisation iteration per epoch.

So for 100 epochs: 100*N_examples/32 model weight optimisation iterations.

All in all, having epoch>1 and having batch_size>1 are compatible.

Mini Batch Gradient Descent, adam and epochs

3 Answers