I am currently working with a big unbalanced data set, and was wondering whether it is possible to use the Time Series Splits Cross-Validation from sklearn to split my training sample into several 'folds'. I want each fold to contain only cross-sectional observations within the timeframe of that specific fold.
As previously mentioned, I am working with an unbalanced panel data set which makes use of the multi-indexing from Pandas. Here a reproducible example to provide some more intuition:
arrays = [np.array(['A', 'A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'D', 'D']),
np.array(['2000-01', '2000-02', '2000-03', '1999-12', '2000-01',
'2000-01', '2000-02', '1999-12', '2000-01', '2000-02', '2000-03'])]
s = pd.DataFrame(np.random.randn(11, 4), index=arrays)
For instance, I would like to initially have all cross sectional units in 1999-12 as training sample and all cross sectional units in 2000-01 as validation. Next, I want all cross-sectional units in 1999-12 and 2000-01 as training, and all cross sectional units in 2000-02 as validation, and so forth. Is this possible with the TimeSeriesSplit function, or do I need to look somewhere else?