2
votes

I am using Timeseriessplit function from sklearn, to create train and test sets for the cross-validation of a timeseries. The idea is for instance to use the n-1 datapoints for training, and the n-th datapoint for testing. This split must be always ordered, as it is a timeseries. However, I don't understand, why the dataset X in the example is formatted as follows:

from sklearn.model_selection import TimeSeriesSplit
import numpy as np
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
tscv = TimeSeriesSplit(n_splits=3)
print(tscv)  
for train_index, test_index in tscv.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

what is the logic behind a preperation of the data as X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])? And of course I read the notes on the page, but still not understanding

1
Can you try to be more specific in what you exactly don't understand on X array structure? - Ferran Parés

1 Answers

1
votes

Typically in time series data you want to predict y[t] based on X[0:t-1] data. This sklearn.model_selection.TimeSeriesSplit method seems to take as arguments a single complete timeseries X of size N (where N is the number of instances at different times steps) and its corresponding labels at each time steps y. Then, X shape is (4,2) because we have four instance at different time steps and each instance have 2 features.

How we interpret this two features might be controversial:

  1. We can consider each instance to be a single sample in an specific point in time having a set of features. Or...
  2. We can consider each instance to be a set of points in time, defining the instance itself during a time interval.

Both option seem correct to me. Although how we may interpret the structure of X, the matter here is how TimeSeriesSplit splits the data avoiding testing data instance from previous time steps of training data instances.