mini-batch gradient descent implementation in tensorflow

Question

When reading an tensorflow implementation for a deep learning model, I am trying to understand the following code segment included in the training process.

self.net.gradients_node = tf.gradients(loss, self.variables)
for epoch in range(epochs):
            total_loss = 0
            for step in range((epoch*training_iters), ((epoch+1)*training_iters)):
                batch_x, batch_y = data_provider(self.batch_size)

                # Run optimization op (backprop)
                _, loss, lr, gradients = sess.run((self.optimizer, self.net.cost, self.learning_rate_node, self.net.gradients_node), 
                                                  feed_dict={self.net.x: batch_x,
                                                             self.net.y: util.crop_to_shape(batch_y, pred_shape),
                                                             self.net.keep_prob: dropout})

                if avg_gradients is None:
                    avg_gradients = [np.zeros_like(gradient) for gradient in gradients]
                for i in range(len(gradients)):
                    avg_gradients[i] = (avg_gradients[i] * (1.0 - (1.0 / (step+1)))) + (gradients[i] / (step+1))

                norm_gradients = [np.linalg.norm(gradient) for gradient in avg_gradients]
                self.norm_gradients_node.assign(norm_gradients).eval()



                total_loss += loss

I think it is related to mini-batch gradient descent, but I cannot understand how does it work, or I have some difficulties to connect it to the algorithm shown as follows

Ishamael Ishamael · Accepted Answer · 2016-11-01T21:30:51

This is not related to mini batch SGD.

It computes average gradient over all timesteps. After the first timestep avg_gradients will contain the gradient that was just computed, after the second step it will be elementwise mean of the two gradients from the two steps, after n steps it will be elementwise mean of all the n gradients computed so far. These mean gradients are then normalized (so that their norm is one). It is hard to tell why those average gradients are needed without the context in which they were presented.

mini-batch gradient descent implementation in tensorflow

1 Answers