6
votes

I created an Octave script for training a neural network with 1 hidden layer using backpropagation but it can not seem to fit an XOR function.

  • x Input 4x2 matrix [0 0; 0 1; 1 0; 1 1]
  • y Output 4x1 matrix [0; 1; 1; 0]
  • theta Hidden / output layer weights
  • z Weighted sums
  • a Activation function applied to weighted sums
  • m Sample count (4 here)

My weights are initialized as follows

epsilon_init = 0.12;
theta1 = rand(hiddenCount, inputCount + 1) * 2 * epsilon_init * epsilon_init;
theta2 = rand(outputCount, hiddenCount + 1) * 2 * epsilon_init * epsilon_init;

Feed forward

a1 = x;
a1_with_bias = [ones(m, 1) a1];
z2 = a1_with_bias * theta1';
a2 = sigmoid(z2);
a2_with_bias = [ones(size(a2, 1), 1) a2];
z3 = a2_with_bias * theta2';
a3 = sigmoid(z3);

Then I compute the logistic cost function

j = -sum((y .* log(a3) + (1 - y) .* log(1 - a3))(:)) / m;

Back propagation

delta2 = (a3 - y);
gradient2 = delta2' * a2_with_bias / m;

delta1 = (delta2 * theta2(:, 2:end)) .* sigmoidGradient(z2);
gradient1 = delta1' * a1_with_bias / m;

The gradients were verified to be correct using gradient checking.

I then use these gradients to find the optimal values for theta using gradient descent, though using Octave's fminunc function yields the same results. The cost function converges to ln(2) (or 0.5 for a squared errors cost function) because the network outputs 0.5 for all four inputs no matter how many hidden units I use.

Does anyone know where my mistake is?

2
Please show weight initialisation (start value for theta). At a guess, that could be your problem. I'll explain if so.Neil Slater
epsilon_init = 0.12; theta1 = rand(hiddenCount, inputCount + 1) * 2 * epsilon_init * epsilon_init; theta2 = rand(outputCount, hiddenCount + 1) * 2 * epsilon_init * epsilon_init; Don't know how to format it correctly in a comment sorry about that!Torax
I was wrong on my hunch, but at least now I can see if I replicate your resultsNeil Slater
I tried again and it does actually not work on OR and AND, though it converges to different values then.Torax

2 Answers

7
votes

Start with a larger range when initialising weights, including negative values. It is difficult for your code to "cross-over" between positive and negative weights, and you probably meant to put * 2 * epsilon_init - epsilon_init; when instead you put * 2 * epsilon_init * epsilon_init;. Fixing that may well fix your code.

As a rule of thumb, I would do something like this:

theta1 = ( 0.5 * sqrt ( 6 / ( inputCount + hiddenCount) ) * 
    randn( hiddenCount, inputCount + 1 ) );
theta2 = ( 0.5 * sqrt ( 6 / ( hiddenCount + outputCount ) ) * 
    randn( outputCount, hiddenCount + 1 ) );

The multiplier is just some advice I picked up on a course, I think that it is backed by a research paper that compared a few different approaches.

In addition, you may need a lot of iterations to learn XOR if you run basic gradient descent. I suggest running for at least 10000 before declaring that learning isn't working. The fminunc function should do better than that.

I ran your code with 2 hidden neurons, basic gradient descent and the above initialisations, and it learned XOR correctly. I also tried adding momentum terms, and the learning was faster and more reliable, so I suggest you take a look at that next.

1
votes

You need at least 3 neurons in the hidden layer and correct the initialization as the first answer suggest. If the sigmoidGradient(z2) means a2.*(1-a2) all the rest of the code seems ok to me.

Best reggards,