how can i take the derivative of the softmax output in back-prop

Question

So I am new to ML and trying to make a simple "library" so I can learn more about neural networks.

My question: According to my understanding I have to take the derivative of each layer according to their activation function so I can calculate their deltas and adjust their weights etc...

For ReLU, sigmoid, tanh, it's super simple to implement them in Java (which is the language I am using BTW)

But to go from output to the input I have to start from (obviously) the output which has an activation function of softmax.

So do I have to take the derivative of the output layer as well or does it just apply to every other layer?

If I do have to get the derivative how can I implement the derivative that in Java? Thanks.

I have read a lot of pages with the explanation of the derivative of the softmax algorithm but they were really complicated for me and as I said I just started to learn ML and I didn't wanted to use a library off the shelf so here I am.

This is the class I store my activation functions.

public class ActivationFunction {

    public static double tanh(double val) {
        return Math.tanh(val);
    }

    public static double sigmoid(double val) {
        return 1 / 1 + Math.exp(-val);
    }

    public static double relu(double val) {
        return Math.max(val, 0);
    }

    public static double leaky_relu(double val) {
        double result = 0;
        if (val > 0) result = val;
        else result = val * 0.01;
        return result;
    }

    public static double[] softmax(double[] array) {
        double max = max(array);
        for (int i = 0; i < array.length; i++) {
            array[i] = array[i] - max;
        }

        double sum = 0;
        double[] result = new double[array.length];
        for (int i = 0; i < array.length; i++) {
            sum += Math.exp(array[i]);
        }
        for (int i = 0; i < result.length; i++) {
            result[i] = Math.exp(array[i]) / sum;
        }
        return result;
    }

    public static double dTanh(double x) {
        double tan = Math.tanh(x);
        return (1 / tan) - tan;
    }

    public static double dSigmoid(double x) {
        return x * (1 - x);
    }

    public static double dRelu(double x) {
        double result;
        if (x > 0) result = 1;
        else result = 0;
        return result;
    }

    public static double dLeaky_Relu(double x) {
        double result;
        if (x > 0) result = 1;
        else if (x < 0) result = 0.01;
        else result = 0;
        return result;
    }

    private static double max(double[] array) {
        double result = Double.MIN_VALUE;
        for (int i = 0; i < array.length; i++) {
            if (array[i] > result) result = array[i];
        }
        return result;
    }
}

I am expecting to get the answer for the question: Do I need the derivative of softmax or not? If so how can I implement it?

Michael Glazunov Michael Glazunov · Accepted Answer · 2019-08-26T16:09:16

A short answer to your first question is yes, you need to compute the derivative of softmax.

The longer version will involve some computation since in order to implement backpropagation you train your network by means of first-order optimization algorithm that requires to calculate partial derivatives of the cost function w.r.t the weights, i.e.:

However, since you are using the softmax for your last layer, it is very likely that you are going to optimize a cross-entropy cost function while training your neural network, namely:

where t_j is a target value and a_j is a softmax result for class j.

Softmax itself represents a probability distribution over n classes:

where all of z's are simple sums of the result of activation functions of previous layers times the corresponding weights:

where n is the number of layer, i is the number of neuron in the previous layer and j is the number of neuron in our softmax layer.

So in order to take partial derivatives with respect to any of these weights, one should calculate:

where second partial derivative ∂a_k/∂z_j is indeed the softmax derivative and can be computed in the following way:

But if you try to compute the aforementioned sum term of the derivative of the cost function w.r.t. the weights, you will get:

So in this particular case it turns out that the final result of the computation is quite neat and represents a simple difference between the outputs of the network and the target values, and that's it, i.e., all you need to compute this sum term of partial derivatives is just:

So to answer your second question, you can combine computation of the partial derivative of the cross-entropy cost function w.r.t output activation (i.e. softmax) together with the partial derivative of the output activation w.r.t. z_j which results in a short and clear implementation, if you are using a non-vectorized form, it will look like this:

for (int i = 0; i < lenOfClasses; ++i)
{
    dCdz[i] = t[i] - a[i];
}

And subsequently you can use dCdz for backpropagating to the rest of the layers of the neural network.

how can i take the derivative of the softmax output in back-prop

1 Answers