5
votes

I am trying to build a predictive model on stock prices. From what I've read, LSTM is a good layer to use. I can't fully understand what my input_shape needs to be for my model though.

Here is the tail of my DataFrame

print(x_test.shape)

I then split the data into train / test

labels = df['close'].values
x_train_df = df.drop(columns=['close'])
x_train, x_test, y_train, y_test = train_test_split(x_train_df.values, labels, test_size=0.2, shuffle=False)

min_max_scaler = MinMaxScaler()
x_train = min_max_scaler.fit_transform(x_train)
x_test = min_max_scaler.transform(x_test)

print('y_train', y_train.shape)
print('y_test', y_test.shape)
print('x_train', x_train.shape)
print('x_test', x_test.shape)
print(x_train)

This yields:

enter image description here

Here's where I am getting confused. Running the simple example, I get the following error:

ValueError: Input 0 of layer lstm_15 is incompatible with the layer: expected ndim=3, found ndim=4. Full shape received: [None, 1, 4026, 5]

I've tried various combinations of messing with the input_shape and have came to the conclusion, I have no idea how to determine the input shape.

model = Sequential()
model.add(LSTM(32, input_shape=(1, x_train.shape[0], x_train.shape[1])))

model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10)

Given my dataframe, what should be my input_shape? I understand that the input shape is batch size, timesteps, data dim. Just not clear how to map those words to my actual data as what I've thought the values were, are actually not.

I was thinking:

  • Batch Size: Number of records I'm passing in (4026)
  • Time Steps: 1 (I'm not sure if this is supposed to be the same value as batch size?)
  • Data Dimension: 1 since my data is 1 dimensional (I think?)
1
here the fastest way to create data for LSTM/RNN: stackoverflow.com/a/62570576/10375049Marco Cerliani

1 Answers

11
votes

First of all, I don't think that you need an LSTM at all. Based on the df.tail(), it appears to me that there is no temporal dependence between the rows of the pandas data-frame (the samples of your dataset). Anyways, I will come back to that later, firstly your question:

  1. Batch size: The number of elements in the batch. In total, the dataset contains 4026 elements. On the other hand, the batch size is the number of elements that are processed in a single batch. Let us assume that it is 2. In that case you will have 2013 of these batches.

  2. Time steps: Equal to the number of samples which have a temporal dependence between them. Assuming that in your dataset each 3 instances constitute data sequence, then the time steps will be 3. Therefore, it follows that each sample in the dataset now consists of 3 measurements, so the total number of elements is 1342 (at the start was 4026).

  3. Data dimension: The number of features for each element in the batch, for each time step - in your case 5, assuming that buy is the label and date is the temporal dependence column.

As a result, the shape of a single batch of data should be (2, 3, 6), while the shape of the whole dataset would be (1342, 3, 6). Please note that these shapes are valid if we consider that you use LSTM and that there is temporal dependence between the each of the time steps. Here is an example code snippet to verify some stuff:

# Random training data
x_train = np.random.rand(1342, 3, 6)
# Random training labels
y_train = np.random.randint(0, 2, 1342)

model = Sequential()
model.add(LSTM(32, input_shape=(3, 6)))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=2, batch_size=32)

Now, back to my previous point. Looking at df.tail(), it seems to me that there is no temporal dependence whatsoever between the samples in the dataset. With that said, I would firstly convert the date column to something meaningful (one-hot encoding of the month in the year, one-hot encoding of the 4 seasons depending on the month, etc). Then, instead of constructing an RNN, I will proceed with a feed-forward neural network with a binary classification output layer.

As for the model, once you take care of all data related stuff, something as simple as this should work for you:

# Random training data
x_train = np.random.rand(4026, 5)
# Random training labels
y_train = np.random.randint(0, 2, 4026)

model = Sequential()
model.add(Dense(32, activation='relu', input_dim=5))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x_train y_train, epochs=2, batch_size=32)

Hope that it helps!