3
votes

according to wikipedia, with the delta rule we adjust the weight by:

dw = alpha * (ti-yi)*g'(hj)xi

when alpha = learning constant, ti - true answer, yi - perceptron's guess,g' = the derivative of the activation function g with respect to the weighted sum of the perceptron's inputs, xi - input.

The part that I don't understand in this formula is the multiplication by the derivative g'. let g = sign(x) (the sign of the weighted sum). so g' is always 0, and dw = 0. However, in code examples I saw in the internet, the writers just omitted the g' and used the formula:

dw = alpha * (ti-yi)*(hj)xi

I will be glad to read a proper explanation!

thank you in advance.

1

1 Answers

4
votes

You're correct that if you use a step function for your activation function g, the gradient is always zero (except at 0), so the delta rule (aka gradient descent) just does nothing (dw = 0). This is why a step-function perceptron doesn't work well with gradient descent. :)

For a linear perceptron, you'd have g'(x) = 1, for dw = alpha * (t_i - y_i) * x_i.

You've seen code that uses dw = alpha * (t_i - y_i) * h_j * x_i. We can reverse-engineer what's going on here, because apparently g'(h_j) = h_j, which means remembering our calculus that we must have g(x) = e^x + constant. So apparently the code sample you found uses an exponential activation function.

This must mean that the neuron outputs are constrained to be on (0, infinity) (or I guess (a, infinity) for any finite a, for g(x) = e^x + a). I haven't run into this before, but I see some references online. Logistic or tanh activations are more common for bounded outputs (either classification or regression with known bounds).