0
votes

I am performing the multiple factors linear regression in matrix form in MATLAB and I have come across the following warning:

Warning: Matrix is close to singular or badly scaled. Results may be inaccurate. RCOND = smth.

I suspect it is because of the way that I am performing the linear regression, I am following the standard method where the vector of coefficients is ((X'X)^(-1))*(X'Y). My matrix X is of the following format: the first column is just all 1 so that the intercept can be found, in other columns I use powers (so a polynomial basis model) of x-coordinates, so x then x^2, x^3 etc. (column vectors). I think the error arises due to the fact that with higher bases the values are extremely small, and somehow it turns them into NaN, thus the warning.

I was thinking of using another variable type, but double is as big as it gets? Is there a way then to force MATLAB not to assign those extremely small values to NaN? If that is of course what MATLAB does.

2

2 Answers

1
votes

Several comments. (Check definitions at end if confused by notation.)

What's going wrong?

  1. One of the assumptions behind linear regression is that E[x_i * x_i'] is full rank. When you approximate the population mean E[x_i * x_i'] with the sample mean X'*X / n, you want X'*X to be full rank! The error is telling you that up to machine precision, this assumption is violated!

    I'm guessing one of your columns is always close to 0, or when you raise to high powers, two columns become numerically similar (eg. 0 or HUGE in the same rows) Imagine the linear equation:

                              y = b1 * x1 + b2 * x2 + e
    

    If x2 is always zero, you're never going to estimate b2 properly, it could be 10, it could be 10^10. Basically the same holds if x2 is extremely close to zero all the time or some linear combination of columns is numerically close to another column: then tiny tiny changes in data will lead to HUGE swings in the estimate. In mathematical terms, what's happening is that E[x_i * x_i'] is effectively less than full rank.

    Something to try, check cond(X'*X) for raising x to powers 1 through 3, check cond(X'*X) for raising x to powers 1 though 4, 1 through 5, etc... At some point, your condition number is going through the roof as X'*X becomes numerically close to being rank deficient.

  2. WAY before you get this error, your estimates are already TOTAL CRAP. This error ("matrix badly scaled etc...") is telling you when your X'X has such a high condition number, that the machine precision of e^-16 combined with this ill conditioned matrix will make your estimates unreliable. But the error in your data is almost certainly WAY WAY BIGGER THAN e^-16 For estimation purposes, your data is effectively multi-collinear WAY WAY earlier.

What should you be doing?

  1. You can't estimate coefficients on such high powers of x. Your data isn't good enough to do that. You need to dial this WAY WAY back until the condition number of X'*X is reasonable.

    Perhaps you can only estimate coefficients on up to a 2nd order polynomial! Don't get greedy and try to estimate what's just not possible.

  2. Compute your estimate with b = X \ y.

    For any linear equation, solving Ax = c with x = inv(A)*c is NOT optimal. Forming the inverse is unnecessary. You can directly solve the linear system with A\c. In this problem, you can solve for your coefficients b with b = (X'*X) \ (X' * y); And because the way the \ operator works (it solves an overdetermined system in the least squares sense), the simplest code is:

    b = X \ y;   % USE THIS! (you can treat as a magical incantation to solve b = inv(X'*X) * X'y
    

    This last point isn't the source of your problem, but you should fix it anyway.

Definitions:

  1. For each observation i, x_i is a k by 1 vector.
  2. n is the number of observations.
  3. Data matrix X is a n by k matrix composed of [x_1'; x_2'; x_3'; ...; x_n'];
  4. y is a n by 1 vector.
  5. We are trying to estimate k by 1 vector b in the linear equation y_i = x_i' * b + e_i.
0
votes

MATLAB doesn't introduce NaN's just because numbers are small (or even very small). If your input x vector does not have NaN's in it then MATLAB is not putting them into x^2 or x^3 or x^n.

However, if one or more columns of your X matrix is close to zero then you have an ill-conditioned matrix that isn't really suitable for this regression. You'd need to rethink the model (i.e. the degree of the polynomial) you are trying to use.

BTW, for this specific problem, unless you are required to write your own function, then you can just use polyfit, or one of the many regression functions in the Statistics Toolbox.

If you are required to write your own function then ensure that you are using the backslash operator not the inv function, i.e. use (X'*X)\(X'*Y) not inv(X'*X)*(X'*Y).