Background
In Tensorflow 2, there exists a class called GradientTape
which is used to record operations on tensors, the result of which can then be differentiated and fed to some minimization algorithm. For example, from the documentation we have this example:
x = tf.constant(3.0)
with tf.GradientTape() as g:
g.watch(x)
y = x * x
dy_dx = g.gradient(y, x) # Will compute to 6.0
The docstring for the gradient
method implies that the first argument can be not just a tensor, but a list of tensors:
def gradient(self,
target,
sources,
output_gradients=None,
unconnected_gradients=UnconnectedGradients.NONE):
"""Computes the gradient using operations recorded in context of this tape.
Args:
target: a list or nested structure of Tensors or Variables to be
differentiated.
sources: a list or nested structure of Tensors or Variables. `target`
will be differentiated against elements in `sources`.
output_gradients: a list of gradients, one for each element of
target. Defaults to None.
unconnected_gradients: a value which can either hold 'none' or 'zero' and
alters the value which will be returned if the target and sources are
unconnected. The possible values and effects are detailed in
'UnconnectedGradients' and it defaults to 'none'.
Returns:
a list or nested structure of Tensors (or IndexedSlices, or None),
one for each element in `sources`. Returned structure is the same as
the structure of `sources`.
Raises:
RuntimeError: if called inside the context of the tape, or if called more
than once on a non-persistent tape.
ValueError: if the target is a variable or if unconnected gradients is
called with an unknown value.
"""
In the above example, it is easy to see that y
, the target
, is the function to be differentiated, and x
is the dependent variable the "gradient" is taken with respect to.
From my limited experience, it appears that the gradient
method returns a list of tensors, one per each element of sources
, and each of these gradients is a tensor that is the same shape as the corresponding member of sources
.
Question
The above description of the behavior of gradients
makes sense if target
contains a single 1x1 "tensor" to be differentiated, because mathematically a gradient vector should be the same dimension as the domain of the function.
However, if target
is a list of tensors, the output of gradients
is still the same shape. Why is this the case? If target
is thought of as a list of functions, shouldn't the output resemble something like a Jacobian? How am I to interpret this behavior conceptually?