I'm trying to solve a machine learning problem. I have a specific dataset with time-series element. For this problem I'm using well-known python library - sklearn
. There are a lot of cross validation iterators in this library. Also there are several iterators for defining cross validation yourself. The problem is that I don't really know how to define simple cross validation for time series. Here is a good example of what I'm trying to get:
Suppose we have several periods (years) and we want to split our data set into several chunks as follows:
data = [1, 2, 3, 4, 5, 6, 7]
train: [1] test: [2] (or test: [2, 3, 4, 5, 6, 7])
train: [1, 2] test: [3] (or test: [3, 4, 5, 6, 7])
train: [1, 2, 3] test: [4] (or test: [4, 5, 6, 7])
...
train: [1, 2, 3, 4, 5, 6] test: [7]
I can't really understand how to create this kind of cross validation with sklearn tools. Probably I should use PredefinedSplit
from sklearn.cross_validation
like that:
train_fraction = 0.8
train_size = int(train_fraction * X_train.shape[0])
validation_size = X_train.shape[0] - train_size
cv_split = cross_validation.PredefinedSplit(test_fold=[-1] * train_size + [1] * validation_size)
Result:
train: [1, 2, 3, 4, 5] test: [6, 7]
But still it's not so good as a previous data split
cv_split = [(data[:i], data[i:]) for i in range(1, len(data))]
. What do you think? – Dan OneațăPredifinedSplit
I put it intoRFECV
which require cross-validation generator or an iterable yielding train/test splits. So I though may be I can solve the problem with sklearn tools – Demyanovcv_split
as I've defined it above is an iterable yielding a train/test split, if we considerdata
to be the indices of the data. – Dan Oneață