1
votes

For my master thesis, I want to predict the price of a stock in the next hour using a LSTM model. My X data contains 30.000 rows with 6 dimensions (= 6 features), my Y data contains 30.000 rows and only 1 dimension (=target variable). For my first LSTM model, I reshaped the X data to (30.000x1x6), the Y data to (30.000x1) and determined the input like this: input_nn = Input(shape=(1, 6))

I am not sure how to reshape the data and to determine the input shape for the model if I want to increase the timesteps. I still want to predict the stock price in the next hour, but include more previous time steps. Do I have to add the data from previous timesteps in my X data in the second dimension?

Can you explain what the number of units of a LSTM exactly refers to? Should it be the same as the number of timesteps in my case?

2

2 Answers

1
votes

You are on the right track but confusing the number of units with timesteps. The units is a hyper-parameter that controls the output dimension of the LSTM. It is the dimension of the LSTM output vector, so if input is (1,6) and you have 32 units you will get (32,) as in the LSTM will traverse the single timestep and produce a vector of size 32.

Timesteps refers to the size of the history you can your LSTM to consider. So it isn't the same as units at all. Instead of processing the data yourself, Keras has a handy TimeseriesGenerator which will take a 2D data like yours and use a sliding window of some timestep size to generate timeseries data. From the documentation:

from keras.preprocessing.sequence import TimeseriesGenerator
import numpy as np

data = np.array([[i] for i in range(50)])
targets = np.array([[i] for i in range(50)])

data_gen = TimeseriesGenerator(data, targets,
                               length=10, sampling_rate=2,
                               batch_size=2)
assert len(data_gen) == 20

batch_0 = data_gen[0]
x, y = batch_0
assert np.array_equal(x,
                      np.array([[[0], [2], [4], [6], [8]],
                                [[1], [3], [5], [7], [9]]]))
assert np.array_equal(y,
                      np.array([[10], [11]]))

which you can use directory in model.fit_generator(data_gen,...) giving you the option to try out different sampling_rates, timesteps etc. You should probably investigate these parameters and how they affect the result in your thesis.

1
votes

Update with code that is roughly 5 times quicker than the last one:

x = np.load(nn_input + "/EOAN" + "/EOAN_X" + ".npy")
y = np.load(nn_input + "/EOAN" + "/EOAN_Y" + ".npy")
num_features = x.shape[1]
num_time_steps = 500

for train_index, test_index in tscv.split(x):
# Split into train and test set
print("Fold:", fold_counter, "\n" + "Train Index:", train_index, "Test Index:", test_index)
x_train_raw, y_train, x_test_raw, y_test = x[train_index], y[train_index], x[test_index], y[test_index]

# Scaling the data
scaler = StandardScaler()
scaler.fit(x_train_raw)
x_train_raw = scaler.transform(x_train_raw)
x_test_raw = scaler.transform(x_test_raw)

# Creating Input Data with variable timesteps
x_train = np.zeros((x_train_raw.shape[0] - num_time_steps + 1, num_time_steps, num_features), dtype="float32")
x_test = np.zeros((x_test_raw.shape[0] - num_time_steps + 1, num_time_steps, num_features), dtype="float32")

for row in range(len(x_train)):
    for timestep in range(num_time_steps):
        x_train[row][timestep] = x_train_raw[row + timestep]

for row in range(len(x_test)):
    for timestep in range(num_time_steps):
        x_test[row][timestep] = x_test_raw[row + timestep]

y_train = y_train[num_time_steps - 1:]
y_test = y_test[num_time_steps - 1:]