41
votes

Scikit-learn utilizes a very convenient approach based on fit and predict methods. I have time-series data in the format suited for fit and predict.

For example I have the following Xs:

[[1.0, 2.3, 4.5], [6.7, 2.7, 1.2], ..., [3.2, 4.7, 1.1]]

and the corresponding ys:

[[1.0], [2.3], ..., [7.7]]

These data have the following meaning. The values stored in ys form a time series. The values in Xs are corresponding time dependent "factors" that are known to have some influence on the values in ys (for example: temperature, humidity and atmospheric pressure).

Now, of course, I can use fit(Xs,ys). But then I get a model in which future values in ys depend only on factors and do not dependend on the previous Y values (at least directly) and this is a limitation of the model. I would like to have a model in which Y_n depends also on Y_{n-1} and Y_{n-2} and so on. For example I might want to use an exponential moving average as a model. What is the most elegant way to do it in scikit-learn

ADDED

As it has been mentioned in the comments, I can extend Xs by adding ys. But this way has some limitations. For example, if I add the last 5 values of y as 5 new columns to X, the information about time ordering of ys is lost. For example, there is no indication in X that values in the 5th column follows value in the 4th column and so on. As a model, I might want to have a linear fit of the last five ys and use the found linear function to make a prediction. But if I have 5 values in 5 columns it is not so trivial.

ADDED 2

To make my problem even more clear, I would like to give one concrete example. I would like to have a "linear" model in which y_n = c + k1*x1 + k2*x2 + k3*x3 + k4*EMOV_n, where EMOV_n is just an exponential moving average. How, can I implement this simple model in scikit-learn?

2
you can simply add [np.nan] + ys[:-1] as one of the factors - behzad.nouri
behzad.nouri, I have extended my question as a reply to your comment. - Roman
I think the best thing to do is to try several different simple models, including what @behzad.nouri suggested, and use cross-validation to see if any one model performs any better than the others and gain an intuition. It might be that the time exact time information is not as important as you think. I would use the extended Xs vector idea in a neural network, and see if that worked. In that approach you could sort of keep your time information, even though the n.n. would be kind of a black box. - cjohnson318
Is your data hosted anywhere? Could I look at it? - cjohnson318
Unfortunately the date are not hosted, and the above example is just an example (the real data have a more complicated structure). - Roman

2 Answers

21
votes

This might be what you're looking for, with regard to the exponentially weighted moving average:

import pandas, numpy
ewma = pandas.stats.moments.ewma
EMOV_n = ewma( ys, com=2 )

Here, com is a parameter that you can read about here. Then you can combine EMOV_n to Xs, using something like:

Xs = numpy.vstack((Xs,EMOV_n))

And then you can look at various linear models, here, and do something like:

from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit ( Xs, ys )
print clf.coef_

Best of luck!

23
votes

According to Wikipedia, EWMA works well with stationary data, but it does not work as expected in the presence of trends, or seasonality. In those cases you should use a second or third order EWMA method, respectively. I decided to look at the pandas ewma function to see how it handled trends, and this is what I came up with:

import pandas, numpy as np
ewma = pandas.stats.moments.ewma

# make a hat function, and add noise
x = np.linspace(0,1,100)
x = np.hstack((x,x[::-1]))
x += np.random.normal( loc=0, scale=0.1, size=200 )
plot( x, alpha=0.4, label='Raw' )

# take EWMA in both directions with a smaller span term
fwd = ewma( x, span=15 )          # take EWMA in fwd direction
bwd = ewma( x[::-1], span=15 )    # take EWMA in bwd direction
c = np.vstack(( fwd, bwd[::-1] )) # lump fwd and bwd together
c = np.mean( c, axis=0 )          # average  

# regular EWMA, with bias against trend
plot( ewma( x, span=20 ), 'b', label='EWMA, span=20' )

# "corrected" (?) EWMA
plot( c, 'r', label='Reversed-Recombined' )

legend(loc=8)
savefig( 'ewma_correction.png', fmt='png', dpi=100 )

enter image description here

As you can see, the EWMA bucks the trend uphill and downhill. We can correct for this (without having to implement a second-order scheme ourselves) by taking the EWMA in both directions and then averaging. I hope your data was stationary!