Gradient descent vs fminunc

Question

I am trying to run gradient descent and cannot get the same result as octaves built-in fminunc, when using exactly the same data

My Code is

%for 5000 iterations
for iter = 1:5000

%%Calculate the cost and the new gradient
[cost, grad] = costFunction(initial_theta, X, y);


%%Gradient = Old Gradient - (Learning Rate * New Gradient)
initial_theta = initial_theta - (alpha * grad);

end

Where costFunction calucates the cost and gradient, when given an example (X,y) and parameters(theta).

a built-in octave function fminunc also calling costFunction and with the same data finds a much much better answer in far fewer iterations.

Given that octave uses the same cost function i assume the costFunction is correct.

I have tried decreasing the learning rate in case i am hitting a local minima and increasing the number of iterations, the cost stops decreasing so i think it seems that it has found the minimum, but the final theta still has a much larger cost and is no where near as accurate

even if fminunc is using a better alogoritm hould gradient descent eventually find the same answer with enough iterations and a smaller learning rate?

or can anyone see if i am doing anything wrong?

Thank you for any and all help.

Just want to point out, lowering the learning rate will do nothing to prevent your algorithm from hitting a local optima. A larger learning rate might successfully jump over a very small one but it's very unlikely. You need to make sure the differential of the function you're optimising is convex, so it has only one optima (the global one) — Jonathon Ashworth

sheerun sheerun · Accepted Answer · 2012-09-23T20:46:58

Your comments are wrong, but the algorithm is good.

In gradient descent it's easy to fall into numerical problems, then I suggest to perform feature normalization.

Also, if you're unsure about your learning rate, try to adjust it dynamically. Something like:

best_cost = Inf;
best_theta = initial_theta;
alpha = 1;

for iter = 1:500
  [cost, grad] = costFunction(best_theta, X_reg, y);

  if (cost < best_cost)
    best_theta = best_theta - alpha * grad;
    best_cost = cost;
  else
    alpha = alpha * 0.99
  end
end

Moreover remember that different answers can give the same decision boundaries. For example for hypothesis h(x) = x(0) + theta(1) * x(1) + theta(2) * x(2) these answers give the same boundary:

theta = [5, 10, 10];
theta = [10, 20, 20];

Gradient descent vs fminunc

1 Answers