1
votes

I'm using statsmodels' OLS linear regression with the Patsy quartic formula y ~ x + I(x**2) + I(x**3) + I(x**4) but the resulting regression poorly fits the data compared to LibreOffice Calc. Why doesn't this match what LibreOffice Calc produces?

statsmodels code:

import io
import numpy
import pandas
import matplotlib
import matplotlib.offsetbox
import statsmodels.tools
import statsmodels.formula.api

csv_data = """Year,CrudeRate
1999,197.0
2000,196.5
2001,194.3
2002,193.7
2003,192.0
2004,189.2
2005,189.3
2006,187.6
2007,186.9
2008,186.0
2009,185.0
2010,186.2
2011,185.1
2012,185.6
2013,185.0
2014,185.6
2015,185.4
2016,185.1
2017,183.9
"""

df = pandas.read_csv(io.StringIO(csv_data))

cause = "Malignant neoplasms"
x = df["Year"].values
y = df["CrudeRate"].values

olsdata = {"x": x, "y": y}
formula = "y ~ x + I(x**2) + I(x**3) + I(x**4)"
model = statsmodels.formula.api.ols(formula, olsdata).fit()

print(model.params)

df.plot("Year", "CrudeRate", kind="scatter", grid=True, title="Deaths from {}".format(cause))

func = numpy.poly1d(model.params.values[::-1])
matplotlib.pyplot.plot(df["Year"], func(df["Year"]))

matplotlib.pyplot.show()

Produces the following coefficients:

Intercept    9.091650e-08
x            9.127904e-05
I(x ** 2)    6.109623e-02
I(x ** 3)   -6.059164e-05
I(x ** 4)    1.503399e-08

And the following graph:

Figure1

However, if I bring the data into LibreOffice Calc, click on the plot and choose "Insert Trend Line...", select "Polynomial", enter "Degrees"=4, and select "Show Equation", the resulting trend line is different from statsmodels and appears to be a closer fit:

Figure2

The coefficients are:

Intercept = 1.35e10
x =          2.69e7
x^2 =       -2.01e4
x^3 =          6.69
x^4 =      -0.83e-3

statsmodels version:

$ pip3 list | grep statsmodels
statsmodels                  0.9.0

Edit: Cubic also doesn't match, but quadratic does.

Edit: Scaling down Year (and doing the same in LibreOffice) matches:

df = pandas.read_csv(io.StringIO(csv_data))
df["Year"] = df["Year"] - 1998

Coefficients and plot after scaling down:

Intercept    197.762384
x             -0.311548
I(x ** 2)     -0.315944
I(x ** 3)      0.031304
I(x ** 4)     -0.000833

Figure3

1
My guess is that the X matrix in the regression is badly conditioned because of the large values of years. Try year - 1998 as the trend variable.Josef
and maybe also scale it down, x**4 will be very large relative to the 1 for the constant.Josef
@JamesPhillips Third order cubic is also quite different than LibreOffice. If I go down to second order quadratic, then things match.freeradical
statsmodels doesn't do any automatic rescaling. polynomials don't work well for large numbers and should always be scaled to a "reasonable" range. For example numpy.polynomial has the option to scale to interval [-1, 1] for which all polynomials are well behaved.Josef
@Josef To clarify my last comment, if I perform the same scaling in LibreOffice by subtracting 1998, then all of the coefficients do match and my problem is solved, so I'm just left with the questions about whether I should always scale down to single digits?freeradical

1 Answers

0
votes

Based on comments from @Josef, the problem is that large numbers don't work with high-order polynomials and statsmodels doesn't auto-scale the domain. In addition, I didn't mention this in the original question because I didn't expect the domain would need to be transformed, but I also needed to predict an out-of-sample value based on the year, so I make this the end of the range:

predict_x = +5
min_scaled_domain = -1
max_scaled_domain = +1
df["Year"] = df["Year"].transform(lambda x: numpy.interp(x, (x.min(), x.max() + predict_x), (min_scaled_domain, max_scaled_domain)))

This transformation creates a well-fitted regression:

Figure4

If the same domain transformation is applied in LibreOffice Calc, then the coefficients match.

Finally, to print the predicted value:

func = numpy.polynomial.Polynomial(model.params)
print(func(max_scaled_domain))