Unbalanced Panel data: How to use Time Series Splits Cross-Validation?

Question

I am currently working with a big unbalanced data set, and was wondering whether it is possible to use the Time Series Splits Cross-Validation from sklearn to split my training sample into several 'folds'. I want each fold to contain only cross-sectional observations within the timeframe of that specific fold.

As previously mentioned, I am working with an unbalanced panel data set which makes use of the multi-indexing from Pandas. Here a reproducible example to provide some more intuition:

arrays = [np.array(['A', 'A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'D', 'D']),
           np.array(['2000-01', '2000-02', '2000-03', '1999-12', '2000-01', 
          '2000-01', '2000-02', '1999-12', '2000-01', '2000-02', '2000-03'])]

s = pd.DataFrame(np.random.randn(11, 4), index=arrays)

Which then looks as follows:

For instance, I would like to initially have all cross sectional units in 1999-12 as training sample and all cross sectional units in 2000-01 as validation. Next, I want all cross-sectional units in 1999-12 and 2000-01 as training, and all cross sectional units in 2000-02 as validation, and so forth. Is this possible with the TimeSeriesSplit function, or do I need to look somewhere else?

Charles Landau Charles Landau · Accepted Answer · 2019-05-29T15:23:41

TimeSeriesSplit is a variation of KFold that ensures ascending index values across each successive fold. As noted in the docs:

In each split, test indices must be higher than before... [also] note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them.

docs

Also remember that KFold and TimeSeriesSplit return indices. You already have the index that you want.

One issue is that accessing a DateTimeIndex slice in a MultiIndex is overly difficult and complicated. See here, here and here. Since you're extracting the data at this point anyway, resetting the index and slicing seems acceptable. Especially since resetting the index doesn't happen in place.

Finally I recommend casting that datetime-like index to an actual datetime data type.

import pandas as pd
import numpy as np
import datetime
arrays = [np.array(['A', 'A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'D', 'D']),
           np.array(['2000-01', '2000-02', '2000-03', '1999-12', '2000-01', 
          '2000-01', '2000-02', '1999-12', '2000-01', '2000-02', '2000-03'])]

# Cast as datetime
arrays[1] = pd.to_datetime(arrays[1])


df = pd.DataFrame(np.random.randn(11, 4), index=arrays)
df.index.sort_values()


folds = df.reset_index() # df still has its multindex after this

# You can tack an .iloc[:, 2:] to the end of these lines for just the values
# Use your predefined conditions to access the datetimes
fold1 = folds[folds["level_1"] <=datetime.datetime(2000, 1, 1)]
fold2 = folds[folds["level_1"] == datetime.datetime(2000, 2, 1)]
fold3 = folds[folds["level_1"] == datetime.datetime(2000, 3, 1)]

Unbalanced Panel data: How to use Time Series Splits Cross-Validation?

1 Answers