2
votes

I am trying to implement linear regression in java. My hypothesis is theta0 + theta1 * x[i]. I am trying to figure out the value of theta0 and theta1 so that the cost function is minimum. I am using gradient descent to find out the value -

In the

while(repeat until convergence)
{
   calculate theta0 and theta1 simultaneously.
}

what is this repeat until convergence? I understood that it is the local minimum but what is exact code that I should put in the while loop?

I am very new to machine learning and just began to code basic algos to get better understanding. Any help will be greatly appreciated.

5
"How do I do gradient descent" is too broad as a question. But, you don't solve a linear regression problem with an iterative algorithm; it's a convex problem with a direct solution.Sean Owen
I cannot disagree more. The fact that the problem has direct solution does not imply that you should not use the iterative solutions. There are dozens reasons for do so, starting with efficiency (as direct methods like OLS require O(n^2) memory and time). There were nice discussions of cross validated about "why we do not use direct solutions of optimization problems when they exists?"lejlot

5 Answers

15
votes

The gradient descent is an iterative approach for minimizing the given function. We start with an initial guess of the solution and we take the gradient of the function at that point. We step the solution in the negative direction of the gradient and we repeat the process. The algorithm will eventually converge where the gradient is zero (which correspond to a local minimum). So your job is to find out the value of theta0 and theta1 that minimization the loss function [for example least squared error]. The term "converges" means you reached in the local minimum and further iteration does not affect the value of parameters i.e. value of theta0 and theta1 remains constant. Lets see an example Note: Assume it is in first quadrant for this explanation.

enter image description here

Lets say you have to minimize a function f(x) [cost function in your case]. For this you need to find out the value of x that minimizes the functional value of f(x). Here is the step by step procedure to find out the value of x using gradient descent method

  1. You choose the initial value of x. Lets say it is in point A in the figure.
  2. You calculate the gradient of f(x) with respect to x at A.
  3. This gives the slope of the function at point A. Since it the function is increasing at A, it will yield a positive value.
  4. You subtract this positive value from initial guess of x and update the value of x. i.e. x = x - [Some positive value]. This brings the x more closer to the D [i.e. the minimum] and reduces the functional value of f(x) [from figure]. Lets say after iteration 1, you reach to the point B.
  5. At point B, you repeat the same process as mention in step 4 and reach the point C, and finally point D.
  6. At point D, since it is local minimum, when you calculate gradient, you get 0 [or very close to 0]. Now you try to update value of x i.e. x = x - [0]. You will get same x [or very closer value to the previous x]. This condition is known as "Convergence". The above steps are for increasing slope but are equally valid for decreasing slope. For example, the gradient at point G results into some negative value. When you update x i.e x = x - [ negative value] = x - [ - some positive value] = x + some positive value. This increases the value of x and it brings x close to the point F [ or close to the minimum].

There are various approaches to solve this gradient descent. As @mattnedrich said, the two basic approaches are

  1. Use fixed number of iteration N, for this pseudo code will be

    iter = 0
    while (iter < N) {
      theta0 = theta0 - gradient with respect to theta0
      theta1 = theta1 - gradient with respect to theta1
      iter++
    }
    
  2. Repeat until two consecutive values of theta0 and theta1 are almost identical. The pseudo code is given by @Gerwin in another answer.

Gradient descent is one of the approach to minimize the function in Linear regression. There exists direct solution too. Batch processing (also called normal equation) can be used to find out the values of theta0 and theta1 in a single step. If X is the input matrix, y is the output vector and theta be the parameters you want to calculate, then for squared error approach, you can find the value of theta in a single step using this matrix equation

theta = inverse(transpose (X)*X)*transpose(X)*y

But as this contains matrix computation, obviously it is more computationally expensive operation then gradient descent when size of the matrix X is large. I hope this may answer your query. If not then let me know.

4
votes

Gradient Descent is an optimization algorithm (minimization be exact, there is gradient ascent for maximization too) to. In case of linear regression, we minimize the cost function. It belongs to gradient based optimization family and its idea is that cost when subtracted by negative gradient, will take it down the hill of cost surface to the optima.

In your algorithm, repeat till convergence means till you reach the optimal point in the cost-surface/curve, which is determined when the gradient is very very close to zero for some iterations. In that case, the algorithm is said to be converged (may be in local optima, and its obvious that Gradient Descent converges to local optima in many cases)

To determine if your algorithm has converged, you can do following:

calculate gradient
theta = theta -gradientTheta
while(True):
    calculate gradient
    newTheta = theta - gradient
    if gradient is very close to zero and abs(newTheta-Theta) is very close to zero:
       break from loop # (The algorithm has converged)
    theta = newTheta

For detail on Linear Regression and Gradient Descent and other optimizations you can follow Andrew Ng's notes : http://cs229.stanford.edu/notes/cs229-notes1.pdf.

1
votes

I do not know very much about the gradient descend, but we learned another way to calculate linear regression with a number of points:

http://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_line

But if you really want to add the while loop, I recommend the following:

Eventually, theta0 and theta1 will converge to a certain value. This means that, no matter how often you apply the formula, it will always stay in the vicinity of that value. (http://en.wikipedia.org/wiki/(%CE%B5,_%CE%B4)-definition_of_limit).

So applying the code once again will not change theta0 and theta1 very much, only for a very small amount. Or: the difference between theta0(1) and the next theta0(1) is smaller than a certain amount.

This brings us to the following code:

double little = 1E-10;
do {
$theta0 = theta0;
$theta1 = theta1;
// now calculate the new theta0, theta1 simultaneously.
} while(Math.abs(theta0-$theta0) + Math.abs(theta1-$theta1)>little);
1
votes

You need to do the following inside of the while loop:

while (some condition is not met)
    // 1) Compute the gradient using theta0 and theta1
    // 2) Use the gradient to compute newTheta0 and newTheta1 values
    // 3) Set theta0 = newTheta0 and theta1 = newTheta1

You can use several different criteria for terminating the gradient descent search. For example, you can run gradient descent

  1. For a fixed number of iterations
  2. Until the gradient value at (theta0, theta1) is sufficiently close to zero (indicating a minimum)

Each iteration you should get closer and closer to the optimal solution. That is, if you compute the error (how well your theta0, theta1 model predicts your data) for each iteration it should get smaller and smaller.

To learn more about how to actually write this code, you can refer to:
https://www.youtube.com/watch?v=CGHDsi_l8F4&list=PLnnr1O8OWc6ajN_fNcSUz9k5gF_E9huF0 https://www.youtube.com/watch?v=kjes46vP5m8&list=PLnnr1O8OWc6asSH0wOMn5JjgSlqNiK2C4

1
votes

Initially assign theta[0] and theta[1] to some arbitrary value and then calculate the value of your hypothesis (theta[0] +theta[1]*x1), and then by the gradient descent algorithm calculate theta[0] and theta[1]. By the algo:

theta[0](new) = theta[0](old) - alpha*[partialderivative(J(theta[0],theta[1]) w.r.t theta[0])

theta[1](new) = theta[1](old) - alpha*[partialderivative(J(theta[0],theta[1]) w.r.t theta[1])

where alpha: learning rate

J(theta[0],theta[1])=cost function

You'll get the new value of theta[0] and theta[1]. You then need to again calculate the new value of your hypothesis. Repeat this process of calculating theta[0] and theta[1], until the difference between theta[i](new) and theta[i](old) is less than 0.001

For details refer: http://cs229.stanford.edu/notes/cs229-notes1.pdf