4
votes

I am currently taking Machine Learning on the Coursera platform and I am trying to implement Logistic Regression. To implement Logistic Regression, I am using gradient descent to minimize the cost function and I am to write a function called costFunctionReg.m that returns both the cost and the gradient of each parameter evaluated at the current set of parameters.

The problem is better described below:

Problem Description

Problem Description Continued

My cost function is working, but the gradient function is not. Please note that I would prefer to implement this using looping, rather than element-by-element operations.

I am computing theta[0] (in MATLAB, theta(1)) separately as it is not being regularized, i.e. we do not use the first term (with lambda).

function [J, grad] = costFunctionReg(theta, X, y, lambda)
    %COSTFUNCTIONREG Compute cost and gradient for logistic regression with regularization
    %   J = COSTFUNCTIONREG(theta, X, y, lambda) computes the cost of using
    %   theta as the parameter for regularized logistic regression and the
    %   gradient of the cost w.r.t. to the parameters. 

    % Initialize some useful values
    m = length(y); % number of training examples
    n = length(theta); %number of parameters (features)

    % You need to return the following variables correctly 
    J = 0;
    grad = zeros(size(theta));

    % ====================== YOUR CODE HERE ======================
    % Instructions: Compute the cost of a particular choice of theta.
    %               You should set J to the cost.
    %               Compute the partial derivatives and set grad to the partial
    %               derivatives of the cost w.r.t. each parameter in theta

    % ----------------------1. Compute the cost-------------------
    %hypothesis
    h = sigmoid(X * theta);

    for i = 1 : m
        % The cost for the ith term before regularization
        J = J - ( y(i) * log(h(i)) )   -  ( (1 - y(i)) * log(1 - h(i)) );

        % Adding regularization term
        for j = 2 : n
            J = J + (lambda / (2*m) ) * ( theta(j) )^2;
        end            
    end
    J = J/m; 

    % ----------------------2. Compute the gradients-------------------

    %not regularizing theta[0] i.e. theta(1) in matlab

    j = 1;

    for i = 1 : m
        grad(j) = grad(j) + ( h(i) - y(i) ) * X(i,j);
    end

    for j = 2 : n    
        for i = 1 : m
            grad(j) = grad(j) + ( h(i) - y(i) ) * X(i,j) + lambda * theta(j);
        end    
    end

    grad = (1/m) * grad;

    % =============================================================
end

What am I doing wrong?

1
Pls help edit the code block formatting...new user and don't know how to :( - Malixa
Coursera machine learning exercise eh? Say hi to Andrew Ng. - Sembei Norimaki
I took the course long time ago. But I think you have to add a column of 0 to the left of X which represents the offset term that you don't have to regularize. Now you are applying the non-regularization term to the first feature (first column), not to the offset term. Try doing X=[zeros(size(X,1),1) X] - Sembei Norimaki
@SembeiNorimaki That isn't correct. A column of ones needs to be added to account for the bias term but you are correct where the bias term is not regularized. However, this function is tested with the proper inputs when submitting this function for marks so the addition of the extra column is not required. It is already done for you when you run the tester code. - rayryeng
True, it must be a column of ones since it's multiplied by the offset value instead of added. I remember having to add this extra column of ones, but maybe this is done in another part of the code. - Sembei Norimaki

1 Answers

2
votes

The way you are applying regularization is incorrect. You add regularization after you sum over all training examples but instead you are adding regularization after each example. If you left your code as it was before the correction, you are inadvertently making the gradient step larger and will eventually overshoot the solution. This overshooting will accumulate and will inevitably give you a gradient vector of Inf or -Inf for all components (except for the bias term).

Simply put, place your lambda*theta(j) statement after the second for loop terminates:

for j = 2 : n    
    for i = 1 : m
        grad(j) = grad(j) + ( h(i) - y(i) ) * X(i,j); % Change
    end
    grad(j) = grad(j) + lambda * theta(j); % Change
end