5
votes

I am trying to fit a linear model and my dataset is normalized where each feature is divided by the maximum possible value. So the values ranges from 0-1. Now i came to know from my previous post Linear Regression vs Closed form Ordinary least squares in Python linear regression in scikit learn produces same result as Closed form OLS when fit_intercept parameter is set to false. I am not quite getting how fit_intercept works.

For any linear problem, if y is the predicted value.

y(w, x) = w_0 + w_1 x_1 + ... + w_p x_p

Across the module, the vector w = (w_1, ..., w_p) is denoted as coef_ and w_0 as intercept_

In closed form OLS we also have a bias value for w_0 and we introduce vector X_0=[1...1] before computing the dot product and solves using matrix multiplication and inverse.

w = np.dot(X.T, X) 
w1 = np.dot(np.linalg.pinv(w), np.dot(X.T, Y))

When fit_intercept is True, scikit-learn linear regression solves the problem if y is the predicted value.

y(w, x) = w_0 + w_1 x_1 + ... + w_p x_p + b where b is the intercept item.

How does it differ to use fit_intercept in a model and when should one set it to True/False. I was trying to look at the source code and it seems like the coefficients are normalized by some scale.

if self.fit_intercept:
    self.coef_ = self.coef_ / X_scale
    self.intercept_ = y_offset - np.dot(X_offset, self.coef_.T)
else:
    self.intercept_ = 0

What does this scaling do exactly. I want to interpret the coefficients in both approach (Linear Regression, Closed form OLS) but since just setting fit_intercept True/False gives different result for Linear Regression i can't quite decide on the intuition behind them. Which one is better and why?

2
There is no intercept-term in the linked answer for OLS. You did present some pseudo-code (or at least it looks like that). Implement it correctly and you will obtain equal results (if you don't have differences in regards to normalization).sascha
I obtained closer results using fit_intercept=False. But here my question is kind of theoretical. Say i want to extract important features depending on the coefficients found from the above steps. Now just setting fit_intercept True/False gives completely different result, so which one of this better to consider. In all machine learning books, linear regression approaches solves it without the intercept parameter but scikit-learn introduced it.Farzana Yusuf
cs229.stanford.edu/notes/cs229-notes1.pdf, i have followed Andrew Ng's ML course too. So this fit_intercept is something that i couldn't relate with what i knew. Is there any paper reference where i can look for the explanation of fit_intercept .Farzana Yusuf
There are so many resources. I can't imagine you did not stumble on anything useful. Here for example. Just use it, if you don't have a good reason not to do it already.sascha

2 Answers

2
votes

Let's take a step back and consider the following sentence you said:

since just setting fit_intercept True/False gives different result for Linear Regression

That is not entirely true. It may or may not be different, and it depends entirely on your data. It would help to understand what goes into the calculation of regression weights. I mean this somewhat literally: what does your input (x) data look like?

Understanding your input data, and understanding why it matters, will help you realize why you sometimes get different results, and why at other times the results are the same

Data setup

Lets set up some test data:

import numpy as np
from sklearn.linear_model import LinearRegression

np.random.seed(1243)

x = np.random.randint(0,100,size=10)
y = np.random.randint(0,100,size=10)

Our x and y variables look like this:

   X   Y
  51  29
   3  73
   7  77
  98  29
  29  80
  90  37
  49   9
  42  53
   8  17
  65  35

No-intercept model

Recall that the calculation of regression weights has a closed form solution, which we can obtain using normal equations:

regression

Using this method, we get a single regression coefficient because we only have 1 predictor variable:

x = x.reshape(-1,1)
w = np.dot(x.T, x)
w1 = np.dot(np.linalg.pinv(w), np.dot(x.T, y))

print(w1)
[ 0.53297593]

Now, let's look at scikit-learn when we set fit_intercept = False:

clf = LinearRegression(fit_intercept=False)

print(clf.fit(x, y).coef_)
[ 0.53297593]

What happens when we set fit_intercept = True instead?

clf = LinearRegression(fit_intercept=True)

print(clf.fit(x, y).coef_)
[-0.35535884]

It would seem that setting fit_intercept to True and False gives different answers, and that the "correct" answer occurs only when we set it to False, but this is not entirely correct...

Intercept model

At this point we have to consider what our input data actually is. In the models above, our data matrix (also called a feature matrix, or design matrix in statistics) is just a single vector containing our x values. The y variable is not included in the design matrix. If we want to add an intercept to our model, one common approach is to add a column of 1's to the design matrix, so x becomes:

x_vals = x.flatten()
x = np.zeros((10, 2))
x[:,0] = 1
x[:,1] = x_vals

   intercept     x
0        1.0  51.0
1        1.0   3.0
2        1.0   7.0
3        1.0  98.0
4        1.0  29.0
5        1.0  90.0
6        1.0  49.0
7        1.0  42.0
8        1.0   8.0
9        1.0  65.0

Now, when we use this as our design matrix, we can try the closed form solution again:

w = np.dot(x.T, x)
w1 = np.dot(np.linalg.pinv(w), np.dot(x.T, y))

print(w1)
[ 59.60686058  -0.35535884]

Notice 2 things:

  1. We now have 2 coefficients. The first is our intercept and the second is the regression coefficient for the x predictor variable
  2. The coefficient for x matches the coefficient from the scikit-learn output above when we set fit_intercept = True

So in the scikit-learn models above, why was there a difference between True and False? Because in one case no intercept was modeled. In the other case the underlying model included an intercept, which is confirmed when you manually add an intercept term/column when solving the normal equations

If you were to use this new design matrix in scikit-learn, it doesn't matter whether you set True or False for fit_intercept, the coefficient for the predictor variable will not change (the intercept value will be different due to centering, but thats irrelevant for this discussion):

clf = LinearRegression(fit_intercept=False)
print(clf.fit(x, y).coef_)
[ 59.60686058  -0.35535884]

clf = LinearRegression(fit_intercept=True)
print(clf.fit(x, y).coef_)
[ 0.         -0.35535884]

Summing up

The output (i.e. coefficient values) you get will be entirely dependent on the matrix that you input into these calculations (whether its normal equations, scikit-learn, or any other)

How does it differ to use fit_intercept in a model and when should one set it to True/False

If your design matrix does not contain a 1's column, then normal equations and scikit-learn (fit_intercept = False) will give you the same answer (as you noted). However, if you set the parameter to True, the answer you get will actually be the same as normal equations if you calculated that with a 1's column.

When should you set True/False? As the name suggests, you set False when you don't want to include an intercept in your model. You set True when you do want an intercept, with the understanding that the coefficient values will change, but will match the normal equations approach when your data includes a 1's column

So True/False doesn't actually give you different results (compared to normal equations) when considering the same underlying model. The difference you observe is because you're looking at two different statistical models (one with an intercept term, and one without). The reason the fit_intercept parameter exists is so you can create an intercept model without the hassle of manually adding that 1's column. It effectively allows you to toggle between the two underlying statistical models.

0
votes

Without going into the details of mathematical formulation, when the fit intercept is set to false, the estimator deliberately sets the intercept to zero and this in turn affects the other regressors as the 'responsibility' of the error reduction falls onto these factors. As a result, the result could be very different in either cases if it is sensitive to the presence of an intercept term. The scaling shifts the origin thereby allowing the same closed loop solutions to both intercept and intercept-free models.