Mathematically, we are trying here to minimise error function
Error(θ) = Σ(yi - h(xi))^2 summation over i.
To minimise error, we do
d(Error(θ))/dθi = Zero
putting h(xi) = Σ(θi*xi) summation over i
and derive the above formula.
The rest of the formulation can be reasoned as
Gradient descent uses the slope of the function itself to find the maxima. Think it as coming downhill in a valley by taking direction such that downward slope is minimum. So, we get the direction but what should be the step size(how long should we continue to move in the same direction?)?
For that also we use the slope. Since at minima slope is zero.(Just think of bottom of a valley since all its nearby points are higher than this. So, there must be this one point where height was reducing, slope was negative and height started increasing, slope changed sign, became negative to positive and in between the minima was point of zero slope.) To reach 0 slope, magnitude of slope decreases towards the minima. So, if magnitude of slope is high, we can take large steps, and if it's low we are closing in on minima and should take small steps.