Should I use the output of Softmax for backpropagation?

Question

I was able to implement Softmax in order to use that for Cross Entropy cost function but my question is, should I use the output of Softmax (i.e probabilities) to do the backpropagation and update the weights?

To me it doesn't exactly look correct because Softmax returns a probability and not the actual values of the neurons.

The other option is to use the output of derivative of Softmax. Can someone explain this please?

Mahdi Dibaiee Mahdi Dibaiee · Accepted Answer · 2017-09-20T10:18:15

You should use the values themselves for computing the derivatives.

The equation for computing the error of the output layer is as follows (f being the activation function and f' the derivative of it):

# outputs[a] represents the output of the (a)th layer
outputs[n] = f(outputs[n-1] . weights[n] + biases[n]) # final output

output_error = (outputs[n] - labels) * f'(outputs[n-1])

Notice that f' is applied to outputs[n-1], not outputs[n], as outputs[n-1] is the original input to our function f(outputs[n-1] . weights[n] + biases[n]).

To better understand how the derivative is useful and how it works, let's first see what's the purpose of it (taken from Wikipedia):

The derivative of a function of a real variable measures the sensitivity to change of the function (output) value with respect to a change in its argument (input value).

Essentially it measures how fast (and in what direction) the output changes when the input is changed by a small amount (you could say it measures how the output depends on the input).

Combined with a method of measuring the error of our network (cost functions), we can gain information on the best way to tweak the inputs of the activation functions (which are our weights) so the output is closer to our desired labels.

We multiply the error by the the derivative, and we have a small update in the direction and proportion that best optimizes the function towards our goal. The update is applied to the weights (which are the inputs of activation functions), so the next time the activation functions fires, the output will be slightly closer to our labels.

Now regarding applying the derivative to the result of the function or it's inputs, as we are looking to see how much the output of our function changes according to it's input, the derivative must take the original inputs of the function in order to give us information about them; that's why the derivative is applied to the inputs of the layer (which are in this case, the outputs of the last layer).

You could also try the experiment below to see why that is the case:

softmax [-1, 0, 1] # [9.003057317038046e-2,0.24472847105479767,0.6652409557748219]
softmax' [-1, 0, 1] # [0.19661193324148185,0.25,0.19661193324148185]
softmax' (softmax [-1, 0, 1]) # [0.24949408957503114,0.24629379904081422,0.22426006146673663]

As you see the softmax' applied to the result of softmax does not convey much information about the original values, as the values produced are too close to each other, but the softmax' applied to the original inputs of softmax give information about the proportions of the inputs.

I recommend this article for explanations on the equations of backpropagation: http://neuralnetworksanddeeplearning.com/chap2.html

Should I use the output of Softmax for backpropagation?

1 Answers