1
votes

I'm working on a project in which I have combined 2 datasets if time series (e.g D1, D2). D1 was with the 5-minutes interval and D2 was for the 1-minute interval, so I transformed the D1 to 1-minute interval and combine with the D2. Now I want to split this new dataset D1D2 into train, test and valid sets on the base of these conditions:

Note: I have searched a lot and try to find a solution for my problem but couldn't any answer fit to my question, so don't mark this as duplicate, please!

  1. The valid set should be 60 values from the end of the dataset.
  2. Then, the test set should be the most recent values till to the valid set
  3. Then, I will have the train set with the remaining data.

Here's how I'm doing the split now:

def split_train_test(dataset, train_size, test_size):
    train = dataset[:train_size, :]
    test = dataset[test_size:, :]
    # split into input and outputs
    train_X, train_y = train[:, :-1], train[:, -1]
    test_X, test_y = test[:, :-1], test[:, -1]
    # reshape input to be 3D [samples, timesteps, features]
    train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
    test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
    print(train_X.shape, train_y.shape, test_X.shape)
    return train, test, train_X, train_y, test_X, test_y

But now I need to convert into train, test and split on the base of the above conditions?

How can I do that? and also is it the right way to split time-series datasets?

1
You can select rows counting reverse: train_df = df[:-60, :] - pissall
so, it will give me the last 60 records for valid set but how can I can split the remaining records to train and test? - Abdul Rehman
I have mentioned 3 conditions above in the question. - Abdul Rehman
What does Then, the test set should be the most recent values till to the valid set mean? - pissall
it means, we take the last 60 values in the valid set that's I mean we have to take the recent values as the test set by leaving the last 60 records of the dataset. - Abdul Rehman

1 Answers

2
votes

Try this:

valid_set = dataset.iloc[-60:, :]
test_set = dataset.iloc[-120:-60]
train_set = dataset.iloc[:-120]

to generalize:

def split_train_test(dataset, validation_size):
    valid = dataset.iloc[-validation_size:, :]
    train_test = dataset.iloc[:-validation_size)]

    train_length = int(0.63 * len(train_test))

    # split into input and outputs
    train_X, train_y = train_test.iloc[:train_length, :-1], train_test.iloc[:train_length, -1]
    test_X, test_y = train_test.iloc[train_length:, :-1], train_test.iloc[train_length:, -1]
    valid_X, valid_y = valid.iloc[:, :-1], valid.iloc[:, -1]

    return train_test, valid, train_X, train_y, test_X, test_y, valid_X, valid_y

You can pass the % split rati into the function as a parameter rather than hardcoding it into the function as I have.