Stochastic Gradient Descent failed to converge in my neural network implementation

Question

I've been trying to use stochastic gradient descent with sum-of-squared-error as cost function to build a neural network using feed-forward backpropagation algorithm that is able to represent this training data :

                 Input      Output
                {{0,1}  , {1,0,0,0,0,0,0,0}}
                {{0.1,1}, {0,1,0,0,0,0,0,0}}
                {{0.2,1}, {0,0,1,0,0,0,0,0}}
                {{0.3,1}, {0,0,0,1,0,0,0,0}}
                {{0.4,1}, {0,0,0,0,1,0,0,0}}
                {{0.5,1}, {0,0,0,0,0,1,0,0}}
                {{0.6,1}, {0,0,0,0,0,0,1,0}}
                {{0.7,1}, {0,0,0,0,0,0,0,1}}

where the it consist of 1 input unit, 1 bias unit, 8 output units and a total of 16 weights (a total of 8 input weights, and 8 bias weights. Each 2 weights (1 from input and 1 from bias) out of total 16 refer to the respective single output unit). However, the set is very slow to converge. I'm using a sigmoid activation function for all output units :

output = 1/(1+e^(-weightedSum))

The error gradient I derived is :

 errorGradient = learningRate*(output-trainingData) * output * (1-output)*inputUnit;

where trainingData variable refers to the target output specified in the training set at the index of the current output unit and inputUnit refers to input unit connected to the current weight. Therefore, I update each individual weight at each iteration with the following equation :

weights of i = weights of i - (learningRate * errorGradient)

Codes:

package ann;


import java.util.Arrays;
import java.util.Random;

public class MSEANN {

static double learningRate= 0.1;
static double totalError=0;
static double previousTotalError=Double.POSITIVE_INFINITY;
static double[] weights;

public static void main(String[] args) {

    genRanWeights();

    double [][][] trainingData = {
            {{0,1}, {1,0,0,0,0,0,0,0}},
            {{0.1,1}, {0,1,0,0,0,0,0,0}},
            {{0.2,1}, {0,0,1,0,0,0,0,0}},
            {{0.3,1}, {0,0,0,1,0,0,0,0}},
            {{0.4,1}, {0,0,0,0,1,0,0,0}},
            {{0.5,1}, {0,0,0,0,0,1,0,0}},
            {{0.6,1}, {0,0,0,0,0,0,1,0}},
            {{0.7,1}, {0,0,0,0,0,0,0,1}},
    };


 while(true){

     int errorCount = 0;
     totalError=0;

     //Iterate through training set
     for(int i=0; i < trainingData.length; i++){
         //Iterate through a list of output unit
         for (int out=0 ; out < trainingData[i][1].length ; out++) {
             double weightedSum = 0;

             //Calculate weighted sum for this specific training set and this specific output unit
             for(int ii=0; ii < trainingData[i][0].length; ii++) {
                 weightedSum += trainingData[i][0][ii] * weights[out*(2)+ii];
             }

             //Calculate output
             double output = 1/(1+Math.exp(-weightedSum));

             double error = Math.pow(trainingData[i][1][out] - output,2)/2;

             totalError+=error;
             if(error >=0.001){
                 errorCount++;
             }



             //Iterate through a the training set to update weights
             for(int iii = out*2; iii < (out+1)*2; iii++) {
                 double firstGrad= -( trainingData[i][1][out] - output  ) * output*(1-output);
                 weights[iii] -= learningRate * firstGrad * trainingData[i][0][iii % 2];
             }

         }

     }


     //Total Error accumulated
     System.out.println(totalError);

     //If error is getting worse every iteration, terminate the program.
     if (totalError-previousTotalError>=0){
          System.out.println("FAIL TO CONVERGE");
          System.exit(0);
     }
     previousTotalError=totalError;

     if(errorCount == 0){
         System.out.println("Final weights: " + Arrays.toString(weights));
         System.exit(0);

     }

 }

}

//Generate random weights
static void genRanWeights() {
    Random r = new Random();
    double low  = -1/(Math.sqrt(2));
    double high = 1/(Math.sqrt(2));
    double[] result = new double[16];
    for(int i=0;i<result.length;i++)  {
        result[i] = low + (high-low)*r.nextDouble();
    }
    System.out.println(Arrays.toString(result));

     weights = result;
}

}

In the code above, I debugged through the ANN by printing the total error accumulated as I run the program, and it is shown at each iteration that the error is reducing at every iteration, however at a VERY SLOW rate. I've tweaked my learning rate, and it does not amount to much. Additionally, I've tried simplifying the training set to the following :

         Input      Output
        {{0  ,1}, {1,0,0,0,0,0,0,0}},
        {{0.1,1}, {0,1,0,0,0,0,0,0}},
//      {{0.2,1}, {0,0,1,0,0,0,0,0}},

and the network trains well very quickly/instantly and is able to reproduce the targeted result. However, if un-comment the 3rd line, the training goes very slowly and never converge at all during running of the program even though I notice the sum of error is decreasing. So based on my experimentation above, the pattern I found out is if I use 3 training set, it will take such a long time that I have never even noticed the ANN finished training. If I use less than 2 or exactly 2, the network is able to instantly produce the correct output.

So my question is, is this 'anomaly' I observe due to that wrong choice of activation function, or due to choice of learning rate, or simply wrong implementation? And in the future, what are the steps you recommend I should to debug effectively for this type of problem?

Think about how the response functions have to look like, the middle ones need a rising and a falling flank. Your network might be too small in the middle layers to realize that. — Lutz Lehmann

Bertrand125 Bertrand125 · Accepted Answer · 2016-10-13T08:27:13

Your implementation seems to be correct and the problem isn't related to the choice of learning rate.

The problem comes from the limitations of the Single-Layer Perceptron (without hidden layers), that can't solve non linearly separable problems, like the XOR binary operation, unless we use a special activation function that make it work for XOR, but I don't know if a special activation function can make it work for your problem. To solve your problem, you probably will have to choose another layout of neural network like the Multi-Layer Perceptron.

The problem you give to the Single-Layer Perceptron is not linearly separable on a 2 dimensions surface. When the input takes only 2 different values, it is possible to separate the outputs with one line. But with 3 or more differents values for the inputs, and the outputs you want, some outputs need two lines to be separated from the other values.

For example, the 2D graph for the second output neuron of your network, and 3 possible values for the input, like in your test:

    ^
    |
    |      line 1      
    |        |   line 2
    |        |     |
    |        |     |
0.0 -     0  |  1  |  0    
    |        |     |
    |
    +-----|-----|-----|-----------> input values
         0.0   0.1   0.2

To separate the 1 from the two 0s, it needs two lines instead of one. So the second neuron won't be able to produce the desired output.

As the bias always have the same value it doesn't influence the problem and is not present on the graph.

If you change the target outputs to have a linearly separable problem, then the Single-Layer Perceptron will work:

{{0.0, 1}, {1,0,0,0,0,0,0,0}},
{{0.1, 1}, {1,1,0,0,0,0,0,0}},
{{0.2, 1}, {1,1,1,0,0,0,0,0}},
{{0.3, 1}, {1,1,1,1,0,0,0,0}},
{{0.4, 1}, {1,1,1,1,1,0,0,0}},
{{0.5, 1}, {1,1,1,1,1,1,0,0}},
{{0.6, 1}, {1,1,1,1,1,1,1,0}},
{{0.7, 1}, {1,1,1,1,1,1,1,1}},

In some cases, it is possible to introduce arbitrary inputs computed from the true inputs. For example, with 4 values possible for the true inputs:

{{-1.0, 0.0, 1}, {1,0,0,0,0,0,0,0}},
{{-1.0, 0.1, 1}, {0,1,0,0,0,0,0,0}},
{{ 1.0, 0.2, 1}, {0,0,1,0,0,0,0,0}},
{{ 1.0, 0.3, 1}, {0,0,0,1,0,0,0,0}},

If, for each output neuron, you draw the graph with the true inputs on X axis and the arbitrary inputs on Y axis, you will see, for the 4 points representing the outputs, that the 1 can be separated from the 0s by only one line.

To handle 8 possible values for the true inputs, you can add a second arbitrary input, and get a 3D graph. Another way to handle 8 possible values without a second arbitrary input is to put the points on a circle. For example:

double [][][] trainingData = {
  {{0.0, 0.0, 1}, {1,0,0,0,0,0,0,0}},
  {{0.0, 0.1, 1}, {0,1,0,0,0,0,0,0}},
  {{0.0, 0.2, 1}, {0,0,1,0,0,0,0,0}},
  {{0.0, 0.3, 1}, {0,0,0,1,0,0,0,0}},
  {{0.0, 0.4, 1}, {0,0,0,0,1,0,0,0}},
  {{0.0, 0.5, 1}, {0,0,0,0,0,1,0,0}},
  {{0.0, 0.6, 1}, {0,0,0,0,0,0,1,0}},
  {{0.0, 0.7, 1}, {0,0,0,0,0,0,0,1}},
};

for(int i=0; i<8;i++) {
  // multiply the true inputs by 8 before the sin/cos in order
  // to increase the distance between points, and multiply the
  // resulting sin/cos by 2 for the same reason
  trainingData[i][0][0] = 2.0*Math.cos(trainingData[i][0][1]*8.0);
  trainingData[i][0][1] = 2.0*Math.sin(trainingData[i][0][1]*8.0);
}

If you don't want to, or can't add arbitrary inputs nor modify the target outputs, you will have to choose another layout of neural network like the Multi-Layer Perceptron. But maybe a special activation function can solve your problem with a Single-Layer Perceptron. I tried with a Gaussian, but it didn't work, maybe due to wrong parameters.

And in the future, what are the steps you recommend I should to debug effectively for this type of problem?

Think to the limitations of the layout you have chosen and try other layouts. If you choose a Multi-Layer Perceptron, think about changing the number of hidden layers, and the number of neurons in these layers.

It is sometimes possible to normalize the inputs and the outputs of the network, in some cases it greatly improve the performances, like in the tests I have done with your training datas. But I think there may be some cases where it is better to have a trained network with the true inputs, whatever the time needed to train the network.

I have tested your training datas with a Multi-Layer Perceptron that have one hidden layer of 15 neurons and that doesn't have a sigmoid function for the output neurons. My network converges and stops at the required error after arround 100 000 training cycles with a learning rate of 0.1.

If I modify the inputs by the following way:

0   -> 0
0.1 -> 1
0.2 -> 2
0.3 -> 3
0.4 -> 4
0.5 -> 5
0.6 -> 6
0.7 -> 7

Then, my network converge a lot more quickly. And even more quickly if I convert the values to the range [-7,7]:

0   -> -7
0.1 -> -5
0.2 -> -3
0.3 -> -1
0.4 ->  1
0.5 ->  3
0.6 ->  5
0.7 ->  7

It is a little faster if I modify the target outputs, replacing the 0s by -1:

{{-7,1}, { 1,-1,-1,-1,-1,-1,-1,-1}},
{{-5,1}, {-1, 1,-1,-1,-1,-1,-1,-1}},
{{-3,1}, {-1,-1, 1,-1,-1,-1,-1,-1}},
{{-1,1}, {-1,-1,-1, 1,-1,-1,-1,-1}},
{{ 1,1}, {-1,-1,-1,-1, 1,-1,-1,-1}},
{{ 3,1}, {-1,-1,-1,-1,-1, 1,-1,-1}},
{{ 5,1}, {-1,-1,-1,-1,-1,-1, 1,-1}},
{{ 7,1}, {-1,-1,-1,-1,-1,-1,-1, 1}},

With this normalization of inputs and outputs, I get the required error after arround 2000 training cycles, against 100 000 without normalization.

Another example is your implementation with the 2 first lines of the training datas, like in your question:

         Input      Output
        {{0  ,1}, {1,0,0,0,0,0,0,0}},
        {{0.1,1}, {0,1,0,0,0,0,0,0}},
//      {{0.2,1}, {0,0,1,0,0,0,0,0}},

It takes arround 600 000 training cycles to get the required error. But if I use these training datas:

 Input      Output
{{0  ,1}, {1,0,0,0,0,0,0,0}},
{{1  ,1}, {0,1,0,0,0,0,0,0}},

with 1 instead of the input 0.1, it take only 9000 training cycles. And moreover, if I use 10 instead of 0.1 and -10 instead of 0, it takes only 1500 training cycles.

But, unlike for my Multi-Layer Perceptron, replacing the 0s in the target outputs by -1 breaks the performances.

Stochastic Gradient Descent failed to converge in my neural network implementation

1 Answers