2
votes
def computeCost(X, y, theta):
    inner = np.power(((X * theta.T) - y), 2)
    return np.sum(inner) / (2 * len(X))

def gradientDescent(X, y, theta, alpha, iters):
    temp = np.matrix(np.zeros(theta.shape))
    params = int(theta.ravel().shape[1]) #flattens
    cost = np.zeros(iters)

    for i in range(iters):
        err = (X * theta.T) - y

        for j in range(params):
            term = np.multiply(err, X[:,j])
            temp[0, j] = theta[0, j] - ((alpha / len(X)) * np.sum(term))

        theta = temp
        cost[i] = computeCost(X, y, theta)

    return theta, cost

Here is the code for linear regression cost function and gradient descent that I've found on a tutorial, but I am not quite sure how it works.

First I get how the computeCost code works since it's just (1/2M) where M is number of data.

For gradientDescent code, I just don't understand how it works in general. I know the formula for updating the theta is something like

theta = theta - (learningRate) * derivative of J(cost function). But I am not sure where alpha / len(X)) * np.sum(term) this comes from on the line updating temp[0,j].

Please help me to understand!

2

2 Answers

1
votes

I'll break this down for you. So in your gradientDescent function, you are taking in the predictor variables(X), target variable(y), the weight matrix(theta) and two more parameters(alpha, iters) which are the training parameters. The function's job is to figure out how much should each column in the predictor variables(X) set should be multiplied by before adding to get the predicted values of target variable(y). In the first line of the function you are initiating a weight matrix called temp to zeros. This basically is the starting point of the final weight matrix(theta) that the function will output in the end. The params variable is basically the number of weights or number of predictor variables. This line makes more sense in the context of neural networks where you create abstraction of features.

In linear regression, the weight matrix will mostly be a one dimensional array. In a typical feed forward neural networks we take in a lets say 5 features, convert it to maybe 4 or 6 or n number of features by using a weight matrix and so on. In linear regression we simply distill all the input features into one output feature. So theta will essentially be equivalent to a row vector(which is evident by the line temp[0,j]=...) and params will be the number of features. We cost array just stores that array at each iteration. Now on to the two loops. In the line err = (X * theta.T) - y under the first for loop, we are calculating the prediction error for each training example. The shape of the err variable be number of examples,1. In the second for loop we are training the model, that is we are gradually updating our term matrix. We run the second for loop iters number of times. In Neural network context, we generally call it epochs. It is basically the number of times you want to train the model.

Now the line term = np.multiply(err, X[:,j]): here we are calculating the individual adjustment that should be made to each weight in the temp matrix. We define the cost as (y_predicted-y_actual)**2/number_of_training_points, where y_predicted = X_1*w_1 + X_2*w_2 + X_3*w_3 +... If we differentiate this cost vwith respect to a particular weight(say W_i) we get (y_predicted-y_actual)*(X_i)/number_of_training_points where X_i is the column that W_i multiplies with. So the term = line basically computes that differentiation part. We can multiply the term variable with the learning rate and subtract from W_i. But as you might notice the term variable is an array. But that is taken care of in the next line. In the next line we take the average of the term (by summing it up and then dividing it by len(X)) and subtract it from the corresponding weight in the temp matrix. Once the weights have been updated and stored and the temp matrix, we replace the original theta by temp. We repeat this process iters number of times

0
votes

If you , instead of writing it like ((alpha / len(X)) * np.sum(term)), write it like (alpha * (np.sum(term) / len(X))) ,which you're allowed to do since multiplication and division are comutative (if i remember the term correctly) then you just multiply alpha by the average error, since term is length X anyway.

This means that you substract the learning rate (alpha) times average error which would be something like X[j] * tetha (actual) - y (ideal) , which incidentally is also close enough to the derivative of (X*tetha - y)^2