How exactly works this simple calculus of a ML gradient descent cost function using Octave\MatLab?

Question

I am following a machine learning course on Coursera and I am doing the following exercise using Octave (MatLab should be the same).

The exercise is related to the calculation of the cost function for a gradient descent algoritm.

In the course slide I have that this is the cost function that I have to implement using Octave:

This is the formula from the course slide:

So J is a function of some THETA variables represented by the THETA matrix (in the previous second equation).

This is the correct MatLab\Octave implementation for the J(THETA) computation:

function J = computeCost(X, y, theta)
%COMPUTECOST Compute cost for linear regression
%   J = COMPUTECOST(X, y, theta) computes the cost of using theta as the
%   parameter for linear regression to fit the data points in X and y

% Initialize some useful values
m = length(y); % number of training examples

% You need to return the following variables correctly 
J = 0;

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the cost of a particular choice of theta
%               You should set J to the cost.

J = (1/(2*m))*sum(((X*theta) - y).^2)

% =========================================================================

end

where:

X is a 2 column matrix of m rows having all the elements of the first column set to the value 1:

X =

1.0000    6.1101
1.0000    5.5277
1.0000    8.5186
......    ......
......    ......
......    ......

y is a vector of m elements (as X):

y =

   17.59200
    9.13020
   13.66200
   ........
   ........
   ........

Finnally theta is a 2 columns vector having 0 asvalues like this:

theta = zeros(2, 1); % initialize fitting parameters
theta
theta =

   0
   0

Ok, coming back to my working solution:

J = (1/(2*m))*sum(((X*theta) - y).^2)

specifically to this matrix multiplication (multiplication between the matrix X and the vector theta): I know that it is a valid matrix multiplication because the number of column of X (2 columns) is equal to the number of rows of theta (2 rows) so it is a perfectly valid matrix multiplication.

My doubt that is driving me crazy (probably it is a trivial doubt) is related to the previous course slide context:

As you can see in the second equation used to calculated the current h_theta(x) value it is using the transposed theta vector and not the theta vector as done in the code.

Why ?!?!

I suspect that it depends only on how was created the theta vector. It was build in this way:

theta = zeros(2, 1); % initialize fitting parameters

that is generating a 2 line 1 column vector instead of a classic one line 2 column vector. So maybe I have not to transpose it. But I am absolutely not sure about this assertion.

Is my intuition correct or what am I missing?

Tasos Papastylianou Tasos Papastylianou · Accepted Answer · 2019-04-13T20:43:32

Your intuition is correct. Effectively it does not matter whether you perform the multiplication as theta.' * X or as X.' * theta, since this either generates a horizontal vector or a vertical vector of the hypothesis representing all observations, and what you're expected to do next is subtract the y vector from the hypothesis vector at each observation, and sum the results. So as long as y has the same orientation as your hypothesis and you subtract at each equivalent point, then the scalar end-result of the summation will be the same.

Often enough, you'll see the X.' * theta version preferred over theta.' * X purely for convenience, to avoid transposing over and over again just to be consistent with the mathematical notation. But this is fine, since the underlying math doesn't really change, only the order of equivalent operations.

I agree it's confusing though, both because it makes it harder to follow the formula when the code effectively looks like it's doing something else, and also since it messes with the usual convention that a vertical vector represents 'coordinates', and a horizontal vector represents observations. In such cases, especially in languages like matlab / octave where the orientation of a vector isn't explicitly defined in the variable's type, it is doubly important to document what you expect the inputs to represent, and preferably there should have been assert statements in the code confirming the input has been passed in the correct orientation. Clearly here they felt it wasn't necessary because this code is acting under controlled conditions in a predefined exercise environment anyway, but it would have been good practice to do so from a software engineering point of view.

How exactly works this simple calculus of a ML gradient descent cost function using Octave\MatLab?

1 Answers