Gradient Descent in linear regression

Question

I am trying to implement linear regression in java. My hypothesis is theta0 + theta1 * x[i]. I am trying to figure out the value of theta0 and theta1 so that the cost function is minimum. I am using gradient descent to find out the value -

In the

while(repeat until convergence)
{
   calculate theta0 and theta1 simultaneously.
}

what is this repeat until convergence? I understood that it is the local minimum but what is exact code that I should put in the while loop?

I am very new to machine learning and just began to code basic algos to get better understanding. Any help will be greatly appreciated.

"How do I do gradient descent" is too broad as a question. But, you don't solve a linear regression problem with an iterative algorithm; it's a convex problem with a direct solution. — Sean Owen
I cannot disagree more. The fact that the problem has direct solution does not imply that you should not use the iterative solutions. There are dozens reasons for do so, starting with efficiency (as direct methods like OLS require O(n^2) memory and time). There were nice discussions of cross validated about "why we do not use direct solutions of optimization problems when they exists?" — lejlot

Unknown Unknown · Accepted Answer · 2014-01-13T05:53:44

The gradient descent is an iterative approach for minimizing the given function. We start with an initial guess of the solution and we take the gradient of the function at that point. We step the solution in the negative direction of the gradient and we repeat the process. The algorithm will eventually converge where the gradient is zero (which correspond to a local minimum). So your job is to find out the value of theta0 and theta1 that minimization the loss function [for example least squared error]. The term "converges" means you reached in the local minimum and further iteration does not affect the value of parameters i.e. value of theta0 and theta1 remains constant. Lets see an example Note: Assume it is in first quadrant for this explanation.

enter image description here

Lets say you have to minimize a function f(x) [cost function in your case]. For this you need to find out the value of x that minimizes the functional value of f(x). Here is the step by step procedure to find out the value of x using gradient descent method

You choose the initial value of x. Lets say it is in point A in the figure.
You calculate the gradient of f(x) with respect to x at A.
This gives the slope of the function at point A. Since it the function is increasing at A, it will yield a positive value.
You subtract this positive value from initial guess of x and update the value of x. i.e. x = x - [Some positive value]. This brings the x more closer to the D [i.e. the minimum] and reduces the functional value of f(x) [from figure]. Lets say after iteration 1, you reach to the point B.
At point B, you repeat the same process as mention in step 4 and reach the point C, and finally point D.
At point D, since it is local minimum, when you calculate gradient, you get 0 [or very close to 0]. Now you try to update value of x i.e. x = x - [0]. You will get same x [or very closer value to the previous x]. This condition is known as "Convergence". The above steps are for increasing slope but are equally valid for decreasing slope. For example, the gradient at point G results into some negative value. When you update x i.e x = x - [ negative value] = x - [ - some positive value] = x + some positive value. This increases the value of x and it brings x close to the point F [ or close to the minimum].

There are various approaches to solve this gradient descent. As @mattnedrich said, the two basic approaches are

Use fixed number of iteration N, for this pseudo code will be

iter = 0
while (iter < N) {
  theta0 = theta0 - gradient with respect to theta0
  theta1 = theta1 - gradient with respect to theta1
  iter++
}

Repeat until two consecutive values of theta0 and theta1 are almost identical. The pseudo code is given by @Gerwin in another answer.

Gradient descent is one of the approach to minimize the function in Linear regression. There exists direct solution too. Batch processing (also called normal equation) can be used to find out the values of theta0 and theta1 in a single step. If X is the input matrix, y is the output vector and theta be the parameters you want to calculate, then for squared error approach, you can find the value of theta in a single step using this matrix equation

theta = inverse(transpose (X)*X)*transpose(X)*y

But as this contains matrix computation, obviously it is more computationally expensive operation then gradient descent when size of the matrix X is large. I hope this may answer your query. If not then let me know.

Gradient Descent in linear regression

5 Answers