For the gradient descent algorithm which minimizes the average squared error, the algorithm finds coefficients to a linear predictor. The algorithm I am referring to is this one. These coefficients that the algorithm finds converge to the global minimum if the learning rate is small enough. We know there is a global minimum because the average squared error is a convex function of the weights.
What about as a function of the learning rate (aka alpha in the linked video)? Consider two methods for choosing the learning rate:
METHOD 1
iterate over all i in the range -15 to 2.
- for each i let learning rate be 3^i .
- run gradient descent for 20000 iterations
- measure your training error
Choose learning 3^i for the i that had the lowest training error.
METHOD 2
iterate over all i in the range -15 to 2.
- for each i let learning rate be 3^i .
- run gradient descent for 20000 iterations
- measure your training error
- if error is higher than previous iteration, choose the i from the previous iteration and break the loop
Is Method 2 correct in assuming that once error increase for some choice of learning rate, all learning rates that are bigger than that one will be even worse?
In method 1 we went over all values of learning rate in a range. In method 2, we said we dont need to go over all values- just until we see an increase in error.