Multiple linear regression with numpy

Question

I want to calculate multiple linear regression with numpy. I need to regress my dependent variable (y) against several independent variables (x1, x2, x3, etc.).

For example, with this data:

print 'y        x1      x2       x3       x4      x5     x6       x7'
for t in texts:
    print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
   .format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)

(output for above:)

y      x1    x2    x3    x4 x5   x6  x7
20.64, 0.0,  296,  54.7, 0, 519, 2,  24.0 
25.12, 0.0,  387,  54.7, 1, 678, 2,  24.0 
19.22, 0.0,  535,  54.7, 0, 296, 2,  24.0 
18.99, 0.0,  519,  18.97, 0, 296, 2,   54.9 
18.89, 0.0,  296,  18.97, 0, 535, 2,   54.9 
25.51, 0.0,  678,  18.97, 1, 387, 2,   54.9 
20.19, 0.0,  296,  25.51,  0,  519,  2,   54.9 
20.75, 0.0,  535,  25.51,  0,  296,  2,   54.9 
24.13, 0.0,  387,  25.51,  1,  678,  2,   54.9 
19.24, 0.0,  519,  0,  0,  296,  2,   55.0 
20.90, 0.0,  296,  0,  0,  535,  2,   55.0 
25.30, 0.0,  678,  0,  1,  387,  2,   55.0 
20.78, 0.0,  296,  0,  0,  519,  2,   55.2 
23.01, 0.0,  535,  0,  0,  296,  2,   55.2 
25.20, 0.0,  387,  0,  1,  678,  2,   55.2 
19.12, 0.0,  519,  0,  0,  296,  2,   55.3 
20.03, 0.0,  296,  0,  0,  535,  2,   55.3 
25.22, 0.0,  678,  0,  1,  387,  2,   55.3

I have created this function that I think it gives the coefficients A from Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + +a7x7 + c.

def calculate_linear_regression_numpy(xx, yy):
    """ calculate multiple linear regression """
    import numpy as np
    from numpy import linalg

    A = np.column_stack((xx, np.ones(len(xx))))
    coeffs = linalg.lstsq(A, yy)[0]  # obtaining the parameters

    return coeffs

xx is a list that contains each row of x's, and yy is a list that contains all y.

The A is this:

00 = {ndarray} [   0.   296.   519.    2.    0.   24.    54.7    1. ]
01 = {ndarray} [   0.   296.   535.    2.    0.   24.    54.7    1. ]
02 = {ndarray} [   0.   387.   678.    2.    1.   24.    54.7    1. ]
03 = {ndarray} [   0.   296.   519.    2.    0.   54.9   18.97957206    1. ]
04 = {ndarray} [   0.   296.   535.    2.    0.   54.9   18.97957206    1. ]
05 = {ndarray} [   0.   387.   678.    2.    1.   54.9   18.97957206    1. ]
06 = {ndarray} [   0.   296.   519.    2.    0.   54.9   25.518085    1.   ]
07 = {ndarray} [   0.   296.   535.    2.    0.   54.9   25.518085    1.   ]
08 = {ndarray} [   0.   387.   678.    2.    1.   54.9   25.518085    1.   ]
09 = {ndarray} [   0.   296.   519.    2.    0.   55.    0.    1.]
10 = {ndarray} [   0.   296.   535.    2.    0.   55.    0.    1.]
11 = {ndarray} [   0.   387.   678.    2.    1.   55.    0.    1.]
12 = {ndarray} [   0.   296.   519.    2.    0.   55.2   0.    1. ]
13 = {ndarray} [   0.   296.   535.    2.    0.   55.2   0.    1. ]
14 = {ndarray} [   0.   387.   678.    2.    1.   55.2   0.    1. ]
15 = {ndarray} [   0.   296.   519.    2.    0.   55.3   0.    1. ]
16 = {ndarray} [   0.   296.   535.    2.    0.   55.3   0.    1. ]
17 = {ndarray} [   0.   387.   678.    2.    1.   55.3   0.    1. ]

And the np.dot(A,coeffs) is this:

[ 19.69873196  20.33871176  24.95249051  19.59198545
20.23196525  24.845744    19.41602911  20.05600891  24.66978766
20.09928377  20.73926357  25.35304232  20.09237109  20.73235089
25.34612964  20.08891474  20.72889454  25.34267329]

At the return of the function, the coeffs, contains this 8 values.

[0.0, -0.0010535377771944548, 0.039998737474281849, 0.62111016637058492, -1.0101687709958682, -0.034563440146209781, -0.026910757873959575, 0.31055508318529385]

I don't know if the coeffs[0] or the coeffs[7] is the c from the equation Y defined above.

I take this coeffs and I calculate the new Ŷ multiplying the coeffs with the new ẍ's, like this:

Ŷ=a1ẍ1 + a2ẍ2 + a3ẍ3 + a4ẍ4 + a5ẍ5 + a6ẍ6 + +a7ẍ7 + c

Am I calculating Ŷ correctly? And what should I do when I get a Ŷ with a negative number? Which term is the c (a[0] or a[7])?

The c term would be a[7] since you are putting the ones column at the end, but your coefficients doesn't make sense, you can check by doing print np.dot(A,coeffs), it should give you yy, or very similar. When I tried I got the coefficients [ -0.49104607 0.83271938 0.0860167 0.1326091 6.85681762 22.98163883 -41.08437805 -19.08085066] — Noel Segura Meraz
Look at the x2 and x3 values of row 00, they are 1.10224946e+09 and 4.40557880e+07, which don't appear anywhere on the first group of data you presented. Also there are 18 rows in the first data and 19 in A — Noel Segura Meraz
If you input the x's values in the right order, then it just means that that is the value your regression is calculating. What a negative Y means depends on what are you calculating. But at equation level, a negative answer is completely valid — Noel Segura Meraz

Chris Chris · Accepted Answer · 2016-01-14T10:10:16

The columns keep the order you specify them in, otherwise you would be unable to use the coefficients!

Remember, from the matrix form of the least squares problem, your estimate of Y is given by A dot C where C is your coefficient vector/matrix.

So, print out A, and it should be in the form of X1....X7 [Column of Ones].

whichever column number contains your ones, is the equivalent entry in the coefficient vector for your offset coefficient.

Just by the size of the parameters coeff[7] looks to be the offset, as it is orders of magnitude larger, which doesn't look logical as a multiplicative coefficient given the X and Y values you supplied.

Multiple linear regression with numpy

1 Answers