1
votes

I have a dataset with 100k rows, which are the pairs of store-item numbers (eg. (store 1, item 190)), 300 columns, which are a series of dates (eg. 2017-01-01, 2017-01-02, 2017-01-03 ...). Values are the sales.

I tried to use LSTM keras to predict future sales, how can I construct my train and validation dataset?

I am thinking to split train and validation like data[:, :n_days] and data[:, n_days:]. So I will have same number of samples (100k) in both my train and validation dataset. Do I think it wrong?

If this is the way, how should I define n_days, should the train and validation dataset be exactly the same dimensions? (something like, n_days = 150, 149 days used to predict 1 day).

1

1 Answers

1
votes

how can I construct my train and validation dataset?

Not sure if a rule of thumb, but a common approach is to split your dataset into a ~80% training set and ~20% validation set; in your case this would be approximately 80k and 20k. The actual percentages may vary, but that ratio is the one most sources recommend. Ideally you would also want to have a test dataset, one that you never used during training or validation, to evaluate the final performance of your models.

Now, regarding the shape of your data it is important to recall what the keras docs on Recurrent Layers says:

Input shape

3D tensor with shape (batch_size, timesteps, input_dim).

Defining this shape would depend on the nature of your problem. You mention you want to predict sales, so this can be phrased as a Regression Problem. You also mention your data consists of 300 columns that make up your time series, and naturally you have the real sales value for each of those rows.

In this case, using a batch size of 1, your shape seems will be (1, 300, 1). Which means you are training on batches of 1 element (the most thorough Gradient update), where each has 300 time steps and 1 feature or dimension on each time step.

For splitting your data one option you can use that has helped me before is the train_test_split method from Sklearn, where you simply pass your data and labels and indicate the ratio you want:

from sklearn.cross_validation import train_test_split
#Split your data to have 15% validation split
X, X_val, Y, Y_val = train_test_split(data, labels, test_size=0.15)