1
votes

For the gradient descent algorithm which minimizes the average squared error, the algorithm finds coefficients to a linear predictor. The algorithm I am referring to is this one. These coefficients that the algorithm finds converge to the global minimum if the learning rate is small enough. We know there is a global minimum because the average squared error is a convex function of the weights.

What about as a function of the learning rate (aka alpha in the linked video)? Consider two methods for choosing the learning rate:

METHOD 1

iterate over all i in the range -15 to 2.

  • for each i let learning rate be 3^i .
  • run gradient descent for 20000 iterations
  • measure your training error

Choose learning 3^i for the i that had the lowest training error.

METHOD 2

iterate over all i in the range -15 to 2.

  • for each i let learning rate be 3^i .
  • run gradient descent for 20000 iterations
  • measure your training error
  • if error is higher than previous iteration, choose the i from the previous iteration and break the loop

Is Method 2 correct in assuming that once error increase for some choice of learning rate, all learning rates that are bigger than that one will be even worse?

In method 1 we went over all values of learning rate in a range. In method 2, we said we dont need to go over all values- just until we see an increase in error.

1

1 Answers

1
votes

Quoting you,

...and measure the error after some fixed number of iterations and when you see an increase in error...

Well, according to the video, this is how we detect the convergence, if the difference in gradient descent is <= 0.001 or some value, so there is already a bound you have set which will not allow further iteration for higher values in change of cost function.

There is only one local/global minima for the convex functions when the hypothesis is a linear predictor, so the gradient descent will naturally bring it down to that minima point.