What is the error in the iterative implementation of gradient descent algorithm below?

Question

I have attempted to implement the iterative version of gradient descent algorithm which however is not working correctly. The vectorized implementation of the same algorithm however works correctly.
Here is the iterative implementation :

function [theta] = gradientDescent_i(X, y, theta, alpha, iterations)

    % get the number of rows and columns
    nrows = size(X, 1);
    ncols = size(X, 2);

    % initialize the hypothesis vector
    h = zeros(nrows, 1);

    % initialize the temporary theta vector
    theta_temp = zeros(ncols, 1);

    % run gradient descent for the specified number of iterations
    count = 1;

    while count <= iterations

        % calculate the hypothesis values and fill into the vector
        for i = 1 : nrows
            for j = 1 : ncols
                term = theta(j) * X(i, j);
                h(i) = h(i) + term;
            end
        end

        % calculate the gradient
        for j = 1 : ncols
            for i = 1 : nrows
                term = (h(i) - y(i)) * X(i, j);
                theta_temp(j) = theta_temp(j) + term;
            end
        end

        % update the gradient with the factor
        fact = alpha / nrows;

        for i = 1 : ncols
            theta_temp(i) = fact * theta_temp(i);
        end

        % update the theta
        for i = 1 : ncols
            theta(i) = theta(i) - theta_temp(i);
        end

        % update the count
        count += 1;
    end
end

And below is the vectorized implementation of the same algorithm :

function [theta, theta_all, J_cost] = gradientDescent(X, y, theta, alpha)

    % set the learning rate
    learn_rate = alpha;

    % set the number of iterations
    n = 1500;

    % number of training examples
    m = length(y);

    % initialize the theta_new vector
    l = length(theta);
    theta_new = zeros(l,1);

    % initialize the cost vector
    J_cost = zeros(n,1);

    % initialize the vector to store all the calculated theta values
    theta_all = zeros(n,2);

    % perform gradient descent for the specified number of iterations
    for i = 1 : n

        % calculate the hypothesis
        hypothesis = X * theta;

        % calculate the error
        err = hypothesis - y;

        % calculate the gradient
        grad = X' * err;

        % calculate the new theta
        theta_new = (learn_rate/m) .* grad;

        % update the old theta
        theta = theta - theta_new;

        % update the cost
        J_cost(i) = computeCost(X, y, theta);

        % store the calculated theta value
        if i < n
            index = i + 1;
            theta_all(index,:) = theta';
    end
end

Link to the dataset can be found here

The filename is ex1data1.txt

ISSUES

For initial theta = [0, 0] (this is a vector!), learning rate of 0.01 and running this for 1500 iterations I get the optimal theta as :

theta0 = -3.6303
theta1 = 1.1664

The above is the output for the vectorized implementation which I know I have implemented correctly (it passed all the test cases on Coursera).

However, when I implemented the same algorithm using the iterative method (1st code I mentioned) the theta values I get are (alpha = 0.01, iterations = 1500):

theta0 = -0.20720
theta1 = -0.77392

This implementation fails to pass the test cases and I know therefore that the implementation is incorrect.

I am however unable to understand where I am going wrong as the iterative code does the same job, same multiplications as the vectorized one and when I tried to trace the output of 1 iteration of both the codes, the values came same (on pen and paper!) but failed when I ran them on Octave.

Any help regarding this would be of great help especially if you could point out where I went wrong and what exactly was the cause of failure.

Points to consider

The implementation of hypothesis is correct as I tested it out and both the codes gave the same results, so no issues here.
I printed the output of the gradient vector in both the codes and realised that the error lies here because the outputs here were very different!

Additionally, here is the code for pre-processing the data :

function[X, y] = fileReader(filename)

    % load the dataset
    dataset = load(filename);

    % get the dimensions of the dataset
    nrows = size(dataset, 1);
    ncols = size(dataset, 2);

    % generate the X matrix from the dataset
    X = dataset(:, 1 : ncols - 1);

    % generate the y vector
    y = dataset(:, ncols);

    % append 1's to the X matrix
    X = [ones(nrows, 1), X];
end

Did you try stepping through the code and comparing the result of each iteration step between the two algorithms? Is it the second iteration that is wrong? If so, maybe initialize theta_temp properly. Note that both versions are iterative, you cannot do gradient descent without iterating. — Cris Luengo
Also, are you sure that you have coded count += 1; in your 'non-vectorised' implementation? To the best of my knowledge, that is not valid Matlab syntax. — Floris
@FlorisSA += is valid in Octave, which is also tagged, so I assume that's what the OP actually used. — beaker
@CrisLuengo Yes, first I checked the output of the hypothesis vector for both the codes (at each step) and that came out to be correct so I guess that the code might be going wrong at the gradient calculation step because that is where the vector values start becoming different for each value of iteration (the first code is wrong). Maybe I am initializing the theta_temp wrong so I will look into that and confirm. By iterative, the first code uses matrix multiplications for all calculations which is much faster as compared to the 2nd code (of course, gradient update has to be iterative). — enigma6174
Maybe call it “loop code” to avoid confusion. In any case, I haven’t looked at your code in enough detail to understand what it does, but one thing that jumped out at me was that theta_temp is initialized to zeros before the first iteration, but not in between iterations. Maybe this is intentional, but it looks like it might be a bug. — Cris Luengo

Nihal Kudligi Nihal Kudligi · Accepted Answer · 2018-09-24T15:49:43

What is going wrong with the first code is that the theta_temp and the h vectors are not being initialised properly. For the very first iteration (when count value equals 1) your code runs properly because for that particular iteration the the h and the theta_temp vectors have been initialised to 0 properly. However, since these are temporary vectors for each iteration of gradient descent, they have not been initialised to 0 vectors again for the subsequent iterations. That is, for iteration 2, the values that are modified into h(i) and theta_temp(i) are just added to the old values. Hence because of that, the code does not work properly. You need to update the vectors as zero vectors at the beginning of each iteration and then they would work correctly. Here is my implementation of your code (the first one, observe the changes) :

function [theta] = gradientDescent_i(X, y, theta, alpha, iterations)

    % get the number of rows and columns
    nrows = size(X, 1);
    ncols = size(X, 2);

    % run gradient descent for the specified number of iterations
    count = 1;

    while count <= iterations

        % initialize the hypothesis vector
        h = zeros(nrows, 1);

        % initialize the temporary theta vector
        theta_temp = zeros(ncols, 1);


        % calculate the hypothesis values and fill into the vector
        for i = 1 : nrows
            for j = 1 : ncols
                term = theta(j) * X(i, j);
                h(i) = h(i) + term;
            end
        end

        % calculate the gradient
        for j = 1 : ncols
            for i = 1 : nrows
                term = (h(i) - y(i)) * X(i, j);
                theta_temp(j) = theta_temp(j) + term;
            end
        end

        % update the gradient with the factor
        fact = alpha / nrows;

        for i = 1 : ncols
            theta_temp(i) = fact * theta_temp(i);
        end

        % update the theta
        for i = 1 : ncols
            theta(i) = theta(i) - theta_temp(i);
        end

        % update the count
        count += 1;
    end
end

I ran the code and it gave the same values of theta which you have mentioned. However, what I wonder is how did you state that the output of hypothesis vector was the same in both cases where clearly, this was one of the reasons for the first code failing!

What is the error in the iterative implementation of gradient descent algorithm below?

1 Answers