I'm trying to understand the Gradient descent algorithm for linear regression.
The question is why we multiply by x(1) at the end of theta1 and don't do that at the end of theta0?
Thanks a lot!
Hypothesis is theta0 + theta1*x
. On differentiating with respect to theta0
and theta1
you get 1
and x
respectively. So, you get x
in update for theta1
but not for theta0
. For more details, refer to this document, cs229-notes1.
In short because of partial derivative & Application of the chain rule.
For Theta 0, when you take derivative of the loss function (MSE) with respect to Theta 0 (Or Beta 0 / Intercept ), your derivative is in the form shown the rightmost of eq1.
imagine...
Y = Mx + C
M = Theta 1
C = Theta 0
Loss Function = (Y - (Mx + C))^2
The derivative is in the form of f(x) * f'(x) if that makes sense. f'(x) in Theta 0 is 1 (watch the video to understand the derivate). So
2(Y - (Mx + C)) * derivative of with respect to C of (Y - (Mx + C))
= 2(Y - (Mx + C)) [disregard the 2 in front]
For Theta 1, when you take derivative of the loss function (MSE) with respect to Theta 1 (Or Beta 1 / slope ), your derivative is in the form shown the rightmost of eq1. In this case f'(x) is x, because.....
2(Y - (Mx + C)) * derivative of with respect to M of (Y - (Mx + C))
= 2(Y - (Mx + C)) * (1*x) [because the only term that is left is dx(Mx)]
Here is a video that can help https://www.youtube.com/watch?v=sDv4f4s2SB8