Fit one-dimensional data with scikit-learn to predict line

Question

I wrote code with scikit-learn to build a SVR prediction model for one-dimensional toy data and then plot it with matplotlib.

The blue line is the true data. The model with the linear kernel fits a nice line, but for the kernel of degree 2, the predictions are not what I would expect. I would like to have a model that would predict the values of the blue line slightly below what the orange line is predicting. I painted a black line to visualize what I had in mind.

Why is this happening? The data seems a good candidate for a polynomial of degree 2. The black trend line following the true data and then decreasing much later on the right should result in a much better fit than what the green line is providing, if I just look at this plot. Shouldn't such a model be found with a polynomial of degree 2 based on the data? It would also curve nicely at X = 0 close to the blue line, instead of having this curvature with a higher estimated y value there.
How can I achieve a model that I want?

The code below is complete and self contained, run it to get the plot above (minus the painted black line)

# some data
y = [0, 3642, 6414, 9844, 13210, 16072, 18868, 22275, 25551, 28949, 31680, 34412, 37290, 39858, 42557, 
    45094, 47354, 49547, 51874, 54534, 55987, 55987, 58377, 60767, 63109, 65060, 66865, 68540, 70328, 
    72035, 73905, 75791, 77873, 79791, 81775, 83726]
X = range(0, len(y))
X_longer = range(0, len(y)*2)

# train models
from sklearn.svm import SVR
import numpy as np
clf_1 = SVR(kernel='poly', C=1e3, degree=1)
clf_2 = SVR(kernel='poly', C=1e3, degree=2)

clf_1.fit(np.array(X).reshape(-1, 1), y)
clf_2.fit(np.array(X).reshape(-1, 1), y)

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

# plot real data
plt.plot(X, y, linewidth=8.0, label='true data')

predicted_1_y = []
predicted_2_y = []

# predict data points based on models
for i in X_longer:
    predicted_1_y.append(clf_1.predict(np.array([i]).reshape(-1, 1)))
    predicted_2_y.append(clf_2.predict(np.array([i]).reshape(-1, 1)))

# plot model predictions
plt.plot(X_longer, predicted_1_y, linewidth=6.0, ls=":", label='model, degree 1')
plt.plot(X_longer, predicted_2_y, linewidth=6.0, ls=":", label='model, degree 2')

plt.legend(loc='upper left')
plt.show()

Inverse Inverse · Accepted Answer · 2017-03-21T19:32:42

This happens because linear and quadratic features will always grow up or down eventually. You would need an operation like square-root or log to pick up decaying feature you want.

A simple way to do this is to transform the input signal before fitting. For example, assume a square-root trend:

X = np.array(X)[:,None]**2
clf = SVR(kernel='linear').fit(X, y)

For more general use-cases, where you really don't know the trend you want, or don't want to assume a particular transformation like this, you might try a regression tool like Eureqa to compute the best transformation and mathematical model possible.

Fit one-dimensional data with scikit-learn to predict line

1 Answers