30
votes

I am trying to implement neural network with RELU.

input layer -> 1 hidden layer -> relu -> output layer -> softmax layer

Above is the architecture of my neural network. I am confused about backpropagation of this relu. For derivative of RELU, if x <= 0, output is 0. if x > 0, output is 1. So when you calculate the gradient, does that mean I kill gradient decent if x<=0?

Can someone explain the backpropagation of my neural network architecture 'step by step'?

6

6 Answers

13
votes

if x <= 0, output is 0. if x > 0, output is 1

The ReLU function is defined as: For x > 0 the output is x, i.e. f(x) = max(0,x)

So for the derivative f '(x) it's actually:

if x < 0, output is 0. if x > 0, output is 1.

The derivative f '(0) is not defined. So it's usually set to 0 or you modify the activation function to be f(x) = max(e,x) for a small e.

Generally: A ReLU is a unit that uses the rectifier activation function. That means it works exactly like any other hidden layer but except tanh(x), sigmoid(x) or whatever activation you use, you'll instead use f(x) = max(0,x).

If you have written code for a working multilayer network with sigmoid activation it's literally 1 line of change. Nothing about forward- or back-propagation changes algorithmically. If you haven't got the simpler model working yet, go back and start with that first. Otherwise your question isn't really about ReLUs but about implementing a NN as a whole.

9
votes

If you have a layer made out of a single ReLU, like your architecture suggests, then yes, you kill the gradient at 0. During training, the ReLU will return 0 to your output layer, which will either return 0 or 0.5 if you're using logistic units, and the softmax will squash those. So a value of 0 under your current architecture doesn't make much sense for the forward propagation part either.

See for example this. What you can do is use a "leaky ReLU", which is a small value at 0, such as 0.01.

I would reconsider this architecture however, it doesn't make much sense to me to feed a single ReLU into a bunch of other units then apply a softmax.

9
votes

Here is a good example, use ReLU to implement XOR: reference, http://pytorch.org/tutorials/beginner/pytorch_with_examples.html

# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt

# N is batch size(sample size); D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 4, 2, 30, 1

# Create random input and output data
x = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 0.002
loss_col = []
for t in range(200):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)  # using ReLU as activate function
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum() # loss function
    loss_col.append(loss)
    print(t, loss, y_pred)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y) # the last layer's error
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T) # the second laye's error 
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0  # the derivate of ReLU
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

plt.plot(loss_col)
plt.show()

More about the derivate of ReLU, you can see here: http://kawahara.ca/what-is-the-derivative-of-relu/

3
votes

Yes the orginal Relu function has the problem you describe. So they later made a change to the formula, and called it leaky Relu In essence Leaky Relu tilts the horizontal part of the function slightly by a very small amount. for more information watch this :

An explantion of activation methods, and a improved Relu on youtube

2
votes

So when you calculate the gradient, does that mean I kill gradient decent if x <= 0?

Yes! ‌ If the weighted sum of the inputs and bias of the neuron (activation function input) is less than zero and the neuron uses the Relu activation function, the value of the derivative is zero during backpropagation and the input weights to this neuron do not change (not updated).

Can someone explain the backpropagation of my neural network architecture 'step by step'?

A simple example can show one step of backpropagation. This example covers a complete process of one step. But you can also check only the part that related to Relu. This is similar to the architecture introduced in question and uses one neuron in each layer for simplicity. The architecture is as follows:

enter image description here

f and g represent Relu and sigmoid, respectively, and b represents bias. Step 1: First, the output is calculated:

enter image description here

This merely represents the output calculation. "z" and "a" represent the sum of the input to the neuron and the output value of the neuron activating function, respectively. So h is the estimated value. Suppose the real value is y.

Weights are now updated with backpropagation.

The new weight is obtained by calculating the gradient of the error function relative to the weight, and subtracting this gradient from the previous weight, ie:

W = =

In backpropagation, the gradient of the last neuron(s) of the last layer is first calculated. A chain derivative rule is used to calculate:

enter image description here

The three general terms used above are:

  • The difference between the actual value and the estimated value

  • Neuron output square

  • And the derivative of the activator function, given that the activator function in the last layer is sigmoid, we have this:

enter image description here

And the above statement does not necessarily become zero.

Now we go to the second layer. In the second layer we will have:

enter image description here

It consisted of 4 main terms:

  • The difference between the actual value and the estimated value.

  • Neuron output square

  • The sum of the loss derivatives of the connected neurons in the next layer

  • A derivative of the activator function and since the activator function is Relu we will have:

if z2<=0 (z2 is the input of Relu function):

enter image description here

Otherwise, it's not necessarily zero:

enter image description here

So if the input of neurons is less than zero, the loss derivative is always zero and weights will not update.

*It is repeated that the sum of the neuron inputs must be less than zero to kill gradient descent.

The example given is a very simple example to illustrate the backpropagation process.

1
votes

Additionally, here you can find an implementation in caffe framework: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/relu_layer.cpp

The negative_slope specifies whether to "leak" the negative part by multiplying it with the slope value rather than setting it to 0. Of course you should set this parameter to zero to have classical version.