Gradient descent on linear regression not converging

Question

I have implemented a very simple linear regression with gradient descent algorithm in JavaScript, but after consulting multiple sources and trying several things, I cannot get it to converge.

The data is absolutely linear, it's just the numbers 0 to 30 as inputs with x*3 as their correct outputs to learn.

This is the logic behind the gradient descent:

train(input, output) {
  const predictedOutput = this.predict(input);
  const delta = output - predictedOutput;

  this.m += this.learningRate * delta * input;
  this.b += this.learningRate * delta;
}

predict(x) {
  return x * this.m + this.b;
}

I took the formulas from different places, including:

Exercises from Udacity's Deep Learning Foundations Nanodegree
Andrew Ng's course on Gradient Descent for Linear Regression (also here)
Stanford's CS229 Lecture Notes
this other PDF slides I found from Carnegie Mellon

I have already tried:

normalizing input and output values to the [-1, 1] range
normalizing input and output values to the [0, 1] range
normalizing input and output values to have mean = 0 and stddev = 1
reducing the learning rate (1e-7 is as low as I went)
having a linear data set with no bias at all (y = x * 3)
having a linear data set with non-zero bias (y = x * 3 + 2)
initializing the weights with random non-zero values between -1 and 1

Still, the weights (this.b and this.m) do not approach any of the data values, and they diverge into infinity.

I'm obviously doing something wrong, but I cannot figure out what it is.

Update: Here's a little bit more context that may help figure out what my problem is exactly:

I'm trying to model a simple approximation to a linear function, with online learning by a linear regression pseudo-neuron. With that, my parameters are:

weights: [this.m, this.b]
inputs: [x, 1]
activation function: identity function z(x) = x

As such, my net will be expressed by y = this.m * x + this.b * 1, simulating the data-driven function that I want to approximate (y = 3 * x).

What I want is for my network to "learn" the parameters this.m = 3 and this.b = 0, but it seems I get stuck at a local minima.

My error function is the mean-squared error:

error(allInputs, allOutputs) {
  let error = 0;
  for (let i = 0; i < allInputs.length; i++) {
    const x = allInputs[i];
    const y = allOutputs[i];
    const predictedOutput = this.predict(x);
    const delta = y - predictedOutput;

    error += delta * delta;
  }

  return error / allInputs.length;
}

My logic for updating my weights will be (according to the sources I've checked so far) wi -= alpha * dError/dwi

For the sake of simplicity, I'll call my weights this.m and this.b, so we can relate it back to my JavaScript code. I'll also call y^ the predicted value.

From here:

error = y - y^
      = y - this.m * x + this.b

dError/dm = -x
dError/db = 1

And so, applying that to the weight correction logic:

this.m += alpha * x
this.b -= alpha * 1

But this doesn't seem correct at all.

This is way too broad as there is some code without context. But this this.m += this.learningRate * delta * input; does not look familiar: the input has nothing to do here. Your bias-handling also looks strange. As i'm not familiar with JS, i expect those expressions are vectorized ones? If not, start from scratch. — sascha
Isn't the learning algorithm w += learningRate * gradient * input? It always appears as result of the derivative of y = m.x + b in respect to m. (Either that, or I'm completely misunderstanding it.) — Alpha
@sascha PS: There is not really much more code than this, but let me know, what can I include that would make this easier to investigate? — Alpha
No the input is wrong. Check those formulas again! All the information of the input is already within your gradient. But i'm afraid there is much more wrong here. Are you doing math on vectors? Does JS do this by default? And yeah, the bias-thing, where did you get the idea to do it that way? — sascha
@sascha Thanks! I'll check the input thing. No, these are all scalars. I'm working on a simple intuitive example to later build on top of it. That's why the bias is separated too. — Alpha

Alpha Alpha · Accepted Answer · 2017-03-19T17:01:58

I finally found what's wrong, and I'm answering my own question in hopes it will help beginners in this area too.

First, as Sascha said, I had some theoretical misunderstandings. It may be correct that your adjustment includes the input value verbatim, but as he said, it should already be part of the gradient. This all depends on your choice of the error function.

Your error function will be the measure of what you use to measure how off you were from the real value, and that measurement needs to be consistent. I was using mean-squared-error as a measurement tool (as you can see in my error method), but I was using a pure-absolute error (y^ - y) inside of the training method to measure the error. Your gradient will depend on the choice of this error function. So choose only one and stick with it.

Second, simplify your assumptions in order to test what's wrong. In this case, I had a very good idea what the function to approximate was (y = x * 3) so I manually set the weights (this.b and this.m) to the right values and I still saw the error diverge. This means that weight initialization was not the problem in this case.

After searching some more, my error was somewhere else: the function that was feeding data into the network was mistakenly passing a 3 hardcoded value into the predicted output (it was using a wrong index in an array), so the oscillation I saw was because of the network trying to approximate to y = 0 * x + 3 (this.b = 3 and this.m = 0), but because of the small learning rate and the error in the error function derivative, this.b wasn't going to get near to the right value, making this.m making wild jumps to adjust to it.

Finally, keep track of the error measurement as your network trains, so you can have some insight into what's going on. This helps a lot to identify a difference between simple overfitting, big learning rates and plain simple mistakes.

Gradient descent on linear regression not converging

1 Answers