cross_validation for time series in scikit learn machine learning

Question

I wasn't able to find information I am looking for so I will post my question here. I am just venturing into machine learning. I did my first multiple regression for a time series using scikit learn library. My code is as shown below

X = df[feature_cols]
y = df[['scheduled_amount']]
index= y.reset_index().drop('scheduled_amount', axis=1)
linreg = LinearRegression()
tscv = TimeSeriesSplit(max_train_size=None, n_splits=11)
li=[]
for train_index, test_index in tscv.split(X):
    train = index.iloc[train_index]
    train_start, train_end = train.iloc[0,0], train.iloc[-1,0]
    test = index.iloc[test_index]
    test_start, test_end = test.iloc[0,0], test.iloc[-1,0]
    X_train, X_test = X[train_start:train_end], X[test_start:test_end]
    y_train, y_test = y[train_start:train_end], y[test_start:test_end]
    linreg.fit(X_train, y_train)
    y_predict = linreg.predict(X_test)
    print('RSS:' + str(linreg.score(X_test, y_test)))
    y_test['predictec_amount'] = y_predict
    y_test.plot()

Not that my data is a time series data and I want to keep the datetime index in my Dataframe when I'm fitting my model. I am using the TimeSeriesSplit for cross-validation. I still don't really understand the cross validation thing. First is there a need for a cross-validation in a time series dataset. Second should I use the last linear_coeff_ or should I get the average of all of them to use for my future prediction.

CV is just to check the performance. After you are satisfied with performance, you should train on the whole data and use that model. — Vivek Kumar
@Vivek Kumar: For time series the "whole data" may not necessarily the best thing to do, e.g. if there are trends, concept-shifts, e.g. it can be that having some constant sliding window of training data. But that really depends on the specific data or preprocessing steps. — Marcus V.

J63 J63 · Accepted Answer · 2018-02-23T12:49:31

Yes there is a need for cross-validation in a timeseries dataset. Basically you need to ensure your model does not overfit your current test and is able to capture past seasonal changes so you can have some confidence in the model doing the same in the future. This method is also used to choose model hyperparameters (i.e. alpha in a Ridge regression).

In order to make future predictions, you should refit your regressor with the whole data and the best hyperparameters or, as @Marcus V. mentioned in the coments, maybe is best to train it only with the most recent data.

cross_validation for time series in scikit learn machine learning

1 Answers