Conceptual understanding of GradientTape.gradient

Question

Background

In Tensorflow 2, there exists a class called GradientTape which is used to record operations on tensors, the result of which can then be differentiated and fed to some minimization algorithm. For example, from the documentation we have this example:

x = tf.constant(3.0)
with tf.GradientTape() as g:
  g.watch(x)
  y = x * x
dy_dx = g.gradient(y, x) # Will compute to 6.0

The docstring for the gradient method implies that the first argument can be not just a tensor, but a list of tensors:

 def gradient(self,
               target,
               sources,
               output_gradients=None,
               unconnected_gradients=UnconnectedGradients.NONE):
    """Computes the gradient using operations recorded in context of this tape.

    Args:
      target: a list or nested structure of Tensors or Variables to be
        differentiated.
      sources: a list or nested structure of Tensors or Variables. `target`
        will be differentiated against elements in `sources`.
      output_gradients: a list of gradients, one for each element of
        target. Defaults to None.
      unconnected_gradients: a value which can either hold 'none' or 'zero' and
        alters the value which will be returned if the target and sources are
        unconnected. The possible values and effects are detailed in
        'UnconnectedGradients' and it defaults to 'none'.

    Returns:
      a list or nested structure of Tensors (or IndexedSlices, or None),
      one for each element in `sources`. Returned structure is the same as
      the structure of `sources`.

    Raises:
      RuntimeError: if called inside the context of the tape, or if called more
       than once on a non-persistent tape.
      ValueError: if the target is a variable or if unconnected gradients is
       called with an unknown value.
    """

In the above example, it is easy to see that y, the target, is the function to be differentiated, and x is the dependent variable the "gradient" is taken with respect to.

From my limited experience, it appears that the gradient method returns a list of tensors, one per each element of sources, and each of these gradients is a tensor that is the same shape as the corresponding member of sources.

Question

The above description of the behavior of gradients makes sense if target contains a single 1x1 "tensor" to be differentiated, because mathematically a gradient vector should be the same dimension as the domain of the function.

However, if target is a list of tensors, the output of gradients is still the same shape. Why is this the case? If target is thought of as a list of functions, shouldn't the output resemble something like a Jacobian? How am I to interpret this behavior conceptually?

Vlad Vlad · Accepted Answer · 2020-03-18T17:14:37

This is how tf.GradientTape().gradient() is defined. It has the same functionality as the tf.gradients(), except that the latter can't be used in eager mode. From the docs of tf.gradients():

It returns a list of Tensor of length len(xs) where each tensor is the sum(dy/dx) for y in ys

where xs are sources and ys are target.

Example 1:

So let's say target = [y1, y2] and sources = [x1, x2]. The result will be:

[dy1/dx1 + dy2/dx1, dy1/dx2 + dy2/dx2]

Example 2:

Compute gradients for loss-per-sample (tensor) vs reduced loss (scalar)

Let w, b be two variables. 
xentropy = [y1, y2] # tensor
reduced_xentropy = 0.5 * (y1 + y2) # scalar
grads = [dy1/dw + dy2/dw, dy1/db + dy2/db]
reduced_grads = [d(reduced_xentropy)/dw, d(reduced_xentropy)/db]
              = [d(0.5 * (y1 + y2))/dw, d(0.5 * (y1 + y2))/db] 
              == 0.5 * grads

Tensorflow example of the above snippet:

import tensorflow as tf

print(tf.__version__) # 2.1.0

inputs = tf.convert_to_tensor([[0.1, 0], [0.5, 0.51]]) # two two-dimensional samples
w = tf.Variable(initial_value=inputs)
b = tf.Variable(tf.zeros((2,)))
labels = tf.convert_to_tensor([0, 1])

def forward(inputs, labels, var_list):
    w, b = var_list
    logits = tf.matmul(inputs, w) + b
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
        labels=labels, logits=logits)
    return xentropy

# `xentropy` has two elements (gradients of tensor - gradient
# of each sample in a batch)
with tf.GradientTape() as g:
    xentropy = forward(inputs, labels, [w, b])
    reduced_xentropy = tf.reduce_mean(xentropy)
grads = g.gradient(xentropy, [w, b])
print(xentropy.numpy()) # [0.6881597  0.71584916]
print(grads[0].numpy()) # [[ 0.20586157 -0.20586154]
                        #  [ 0.2607238  -0.26072377]]

# `reduced_xentropy` is scalar (gradients of scalar)
with tf.GradientTape() as g:
    xentropy = forward(inputs, labels, [w, b])
    reduced_xentropy = tf.reduce_mean(xentropy)
grads_reduced = g.gradient(reduced_xentropy, [w, b])
print(reduced_xentropy.numpy()) # 0.70200443 <-- scalar
print(grads_reduced[0].numpy()) # [[ 0.10293078 -0.10293077]
                                #  [ 0.1303619  -0.13036188]]

If you compute loss (xentropy) for each element in a batch the final gradients of each variable will be the sum of all gradients for each sample in a batch (which makes sense).

Conceptual understanding of GradientTape.gradient

Background

Question

1 Answers