Gradient Descent For Mutivariate Linear Regression

Question

Ok, so what does this algorithm exactly mean?

What I know :

i) alpha : how big the step for gradient descent will be.

ii) Now , ∑{ hTheta[x(i)] - y(i) } : refers to Total Error with given values of Theta.

The error refers to the difference between predicted value{ hTheta[x(i)] } and the actual value.[ y(i) ]

∑{ hTheta[x(i)] - y(i) } gives us the summation of all errors from all training examples.

What does Xj^(i) at the end stand for?

Are we doing the following while implementing Gradient Descent for multiple variable Linear Regression?

Theta (j) minus:

alpha
times 1/m
times:

{ error of first training example multiplied by jth element of first training example. PLUS

error of second training example mutiplied by jth element of second training example. PLUS

.

PLUS error of nth training example multiplied by jth element of nth training example. }

n is the number of parameters that you want to estimate and m is the number of observations (training instances) in the data. Inside the summation is for one single training example and you run this algorithm for all the training examples until convergence. j is your index number for the parameters. x^{i} is input features of ith training example. x_{j}^{i} is feature j in ith training example. — ARAT

Muhammad Nizami Muhammad Nizami · Accepted Answer · 2017-12-04T23:56:17

Gradient Descent is an iterative algorithm for finding the minimum of a function. When given a convex function, it is guaranteed to find the global minimum of the function given small enough alpha. Here is gradient descent algorithm to find the minimum of function J:

The idea is to move the parameter in the opposite direction of the gradient at learning rate alpha. Eventually it will go down to the minimum of the function.

We can rewrite this parameter update for each axis of theta:

In multivariate linear regression, the goal of the optimization is to minimize the sum of squared errors:

The partial derivative of this cost function can be derived by using differentiation by substitution, where we use elementary power rule, substracting the power 2 to 1 and putting 2 as coefficient, eliminating the 1/2 coefficient. Then we put the derivative of h(x) to theta_j, which is x_j to the right.

Here, x_j^(i) is stands for the partial derivative of h_theta(x^(i)) to theta_j. x_j^(i) is the j-th element of i-th data.

Gradient Descent For Mutivariate Linear Regression

1 Answers