11
votes

I'm trying to solve a machine learning problem. I have a specific dataset with time-series element. For this problem I'm using well-known python library - sklearn. There are a lot of cross validation iterators in this library. Also there are several iterators for defining cross validation yourself. The problem is that I don't really know how to define simple cross validation for time series. Here is a good example of what I'm trying to get:

Suppose we have several periods (years) and we want to split our data set into several chunks as follows:

data = [1, 2, 3, 4, 5, 6, 7]

train: [1]                test: [2] (or test: [2, 3, 4, 5, 6, 7])
train: [1, 2]             test: [3] (or test: [3, 4, 5, 6, 7])
train: [1, 2, 3]          test: [4] (or test: [4, 5, 6, 7])
...
train: [1, 2, 3, 4, 5, 6] test: [7]

I can't really understand how to create this kind of cross validation with sklearn tools. Probably I should use PredefinedSplit from sklearn.cross_validation like that:

train_fraction  = 0.8
train_size      = int(train_fraction * X_train.shape[0])
validation_size = X_train.shape[0] - train_size

cv_split = cross_validation.PredefinedSplit(test_fold=[-1] * train_size + [1] * validation_size)

Result:

train: [1, 2, 3, 4, 5] test: [6, 7]

But still it's not so good as a previous data split

2
what are the variables in your data set? why is it important to use the time series to split, why not just split randomly?maxymoo
You could generate the splits without the use of scikit-learn, as follows: cv_split = [(data[:i], data[i:]) for i in range(1, len(data))]. What do you think?Dan Oneață
@maxymoo, The reason not to split randomly with time series data is that time might matter (not just the other features you've identified) but "in the wild" you never get to train your model on data from the future. So in testing your model, you should behave similarly and not train on data from after the test date(s).dslack
@DanOneață I'm sorry that I have not mention this in the question, but after the cretion of PredifinedSplit I put it into RFECV which require cross-validation generator or an iterable yielding train/test splits. So I though may be I can solve the problem with sklearn toolsDemyanov
@Demyanov But cv_split as I've defined it above is an iterable yielding a train/test split, if we consider data to be the indices of the data.Dan Oneață

2 Answers

6
votes

You can obtain the desired cross-validation splits without using sklearn. Here's an example

import numpy as np

from sklearn.svm import SVR
from sklearn.feature_selection import RFECV

# Generate some data.
N = 10
X_train = np.random.randn(N, 3)
y_train = np.random.randn(N)

# Define the splits.
idxs = np.arange(N)
cv_splits = [(idxs[:i], idxs[i:]) for i in range(1, N)]

# Create the RFE object and compute a cross-validated score.
svr = SVR(kernel="linear")
rfecv = RFECV(estimator=svr, step=1, cv=cv_splits)
rfecv.fit(X_train, y_train)
4
votes

Meanwhile this was added to the library: http://scikit-learn.org/stable/modules/cross_validation.html#time-series-split

Example from the doc:

>>> from sklearn.model_selection import TimeSeriesSplit

>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([1, 2, 3, 4, 5, 6])
>>> tscv = TimeSeriesSplit(n_splits=3)
>>> print(tscv)  
TimeSeriesSplit(n_splits=3)
>>> for train, test in tscv.split(X):
...     print("%s %s" % (train, test))
[0 1 2] [3]
[0 1 2 3] [4]
[0 1 2 3 4] [5]