So when you calculate the gradient, does that mean I kill gradient
decent if x <= 0?
Yes! If the weighted sum of the inputs and bias of the neuron (activation function input) is less than zero and the neuron uses the Relu activation function, the value of the derivative is zero during backpropagation and the input weights to this neuron do not change (not updated).
Can someone explain the backpropagation of my neural network architecture 'step by step'?
A simple example can show one step of backpropagation. This example covers a complete process of one step. But you can also check only the part that related to Relu. This is similar to the architecture introduced in question and uses one neuron in each layer for simplicity. The architecture is as follows:
f and g represent Relu and sigmoid, respectively, and b represents bias.
Step 1:
First, the output is calculated:
This merely represents the output calculation. "z" and "a" represent the sum of the input to the neuron and the output value of the neuron activating function, respectively.
So h is the estimated value. Suppose the real value is y.
Weights are now updated with backpropagation.
The new weight is obtained by calculating the gradient of the error function relative to the weight, and subtracting this gradient from the previous weight, ie:
In backpropagation, the gradient of the last neuron(s) of the last layer is first calculated. A chain derivative rule is used to calculate:
The three general terms used above are:
The difference between the actual value and the estimated value
Neuron output square
And the derivative of the activator function, given that the activator function in the last layer is sigmoid, we have this:
And the above statement does not necessarily become zero.
Now we go to the second layer. In the second layer we will have:
It consisted of 4 main terms:
The difference between the actual value and the estimated value.
Neuron output square
The sum of the loss derivatives of the connected neurons in the next layer
A derivative of the activator function and since the activator function is Relu we will have:
if z2<=0 (z2 is the input of Relu function):
Otherwise, it's not necessarily zero:
So if the input of neurons is less than zero, the loss derivative is always zero and weights will not update.
*It is repeated that the sum of the neuron inputs must be less than zero to kill gradient descent.
The example given is a very simple example to illustrate the backpropagation process.