How it the concept of gradient in TensorFlow related to the mathematical definition of gradient?

Question

The TensorFlow documentation explains the function

tf.gradients(
    ys,
    xs,
    grad_ys=None,
    name='gradients',
    colocate_gradients_with_ops=False,
    gate_gradients=False,
    aggregation_method=None,
    stop_gradients=None
)

saying:

[it] constructs symbolic derivatives of sum of ys w.r.t. x in xs.
ys and xs are each a Tensor or a list of tensors.
gradients() adds ops to the graph to output the derivatives of ys with respect to xs.
ys: A Tensor or list of tensors to be differentiated

I find it difficult to relate this with the mathematical definition of gradient. For example, according to wikipedia, the gradient of a scalar function f(x1, x2, x3, ..., xn) is a vector field (i.e. a function grad f : Rn -> Rn) with certain properties involving the dot product of vectors. You can also speak about the gradient of f at a certain point: (grad f)(x1, x2, x3, ..., xn).

The TensorFlow documentation speaks about tensors instead of vectors: can the definition of gradient be generalized from functions that map vectors to scalars to functions that map tensors to scalars? Is there a dot product between tensors?

Even if the definition of gradient can be applied to functions f that map tensors to scalars (with the dot product in the definition working on tensors), the documentation speaks about differentiating tensors themselves: the parameter ys is a "Tensor or list of tensors to be differentiated". According to the documentation "Tensor is a multi-dimensional array used for computation", so a tensor is not a function, how can it be differentiated?

So, how is this concept of gradient in TensorFlow exactly related to the definition from wikipedia?

Of course there is a dot (inner) product between tensors: en.wikipedia.org/wiki/Dot_product#Tensors — desertnaut

Abhimanyu Pallavi Sudhir Abhimanyu Pallavi Sudhir · Accepted Answer · 2020-06-09T16:29:59

One would expect that the Tensorflow Gradient is simply the Jacobian, i.e. the derivative of a rank (m) tensor Y against a rank (n) tensor X is the rank (m + n) tensor comprised of each individual derivative ∂Y^j₁...j_m/∂X^i₁...i_n.

However, you may notice that the gradient isn't actually a rank (m + n) tensor, but rather always takes the rank n of the tensor X -- indeed, it appears that Tensorflow gives you the gradient of the scalar sum(Y) against X.

Of course, the real Jacobians are stored internally for calculation in applying the Chain rule.

How it the concept of gradient in TensorFlow related to the mathematical definition of gradient?

1 Answers