Forecasting stocks with LSTM in Keras (Python 3.7, Tensorflow 2.1.0)

Question

I'm trying to use LSTM to predict how the Dow Jones Industrial Average will perform in coming months. I think it is appropriate to frame this as a time series scenario since the DJIA behaves like a stock, with my data values spread evenly in time. I'm only a beginner, so starting simply with only one feature (daily close value). Now I know that stocks are very random and it's hard to predict them well. And, the close value alone is not very informative... but I'll add other features later.

Dataset: DJIA historical data, Jan 28, 1985 - Jun 24, 2020, downloadable here: https://finance.yahoo.com/quote/%5EDJI/history?p=%5EDJI.

Visualization with matplotlib:

I use a series of close values (number = 'sequence_length') to predict the close value that immediately follows the series (sequence_length + 1). For example, I use days 0-29 to predict day 30, days 1-30 to predict day 31, etc. Put another way, I partition the data such that x_train[0] contains close values for days 0-29, and y_train[0] contains the single value for day 31. Ok. So this is the result I get after running the model on my test data:

Ostensibly great, but I'm wondering if this whole concept is flawed: is the model merely seeing the data repetitively, and not learning any underlying pattern? See below for DJIA close predictions for 7/2020 through 4/20201. It seems to me that the prediction curve mimics the exact shape of the testing data, falling below 20,000 points and all...

Questions

Is this model valid? Is it a matter of changing parameters or reformatting data?
How the heck do you evaluate a model like this? Apparently 'accuracy' is an invalid metric. See below for loss curve
It was suggested that instead of using scalar close values for labels, I use sequences instead. For example, x_train[0] might include close values for days 0-29, and y_train[0] would include close values for days 30-60. I have been trying in vain to make this work and apparently have no idea how. I tried to make y_test and y_train Numpy arrays including arrays of sequence data - like this:

y_train, y_test = [], []
    
for i in range(sequence_length, len(training_set_scaled)):
    y_train.append(training_set_scaled[i + sequence_length: sequence_length*2 + i, 0])
    y_test.append(testing_set_scaled[i + sequence_length: sequence_length*2 + i, 0])
    
y_train = np.array(list(y_item for y_item in y_train))
y_test = np.array(list(y_item for y_item in y_test))

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).

Any help would be SO greatly appreciated, and perhaps we can all benefit ($). Joking... sort of.

The Code

df = pd.read_csv('DJIA_historical_data.csv') # 2D. Shape: (8924 examples, 7 features)
close_data = df['Close'] # 1D (examples, )
dates = df['Date'] # 1D (examples, )
adj_dates = mdates.datestr2num(dates) # Convert Pandas series to np array so matplotlib can plot

# Important parameter
sequence_length: int = 90 # Aka 'timesteps', or number of close values used to make each new prediction

# Split off the training set and scale it. 
percent_training: float = 0.80
num_training_samples = ceil(percent_training*len(df)) # A whole number
training_set = df.iloc[:num_training_samples, 5:6].values # 2D, shape: (samples, 1 feature)
scaler = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = scaler.fit_transform(training_set) #Shape is 2D: (num_training_samples, 1)

# Build 3D training set. Final shape: (examples, sequence_length, 1) 
x_train = np.array([training_set_scaled[i - sequence_length:i, 0] for i in range(sequence_length, len(training_set_scaled))]) 
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))

# Build test sets
num_testing_samples: int = len(df) - x_train.shape[0] # Scalar value
testing_set = df.iloc[-num_testing_samples:, 5:6].values # 2D (examples, 1)
testing_set_scaled = scaler.fit_transform(testing_set) # 2D ndarray (examples, 1)

x_test = np.array([testing_set_scaled[i - sequence_length:i, 0] for i in range(sequence_length, len(testing_set_scaled))])
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1)) #3D shape: (examples-sequence_length, sequence_length, 1). 

# Build 1D training labels (examples, )
y_train = np.array([training_set_scaled[i, 0] for i in range(sequence_length, len(training_set_scaled))])
y_test = np.array([testing_set_scaled[i, 0] for i in range(sequence_length, len(testing_set_scaled))]) # (examples-sequence_length, 1)
y_test = np.reshape(y_test, (y_test.shape[0])) #1D (examples, )

# Build Model
epochs: int = 150
batch_size: int = 32

LSTM_1 = LSTM(
    units = 5, # I reduced model complexity because I thought it would reduce overfitting. No such luck
    input_shape = (x_train.shape[1], 1),
    return_sequences = False,
    )

LSTM_2 = LSTM(
    units = 10
    )

model = Sequential()
model.add(LSTM_1) # Output shape: (batch_size, sequence_length, units)
model.add(Dropout(0.4))
# model.add(LSTM_2) # Output shape: ?
# model.add(Dropout(0.2))

model.add(Dense(1)) # Is linear activation appropriate here?
model.compile(loss = 'mean_squared_error', 
             optimizer = 'adam', 
             )

early_stopping = EarlyStopping(monitor='val_loss', 
                               mode='min', 
                               verbose = 1, 
                               patience = 9,
                               restore_best_weights = False
                               )

history = model.fit(x_train,
          y_train,
          epochs = epochs, 
          batch_size = batch_size,
          verbose = 2, 
          validation_split = 0.20,
          # validation_data = (x_test, y_test),
          callbacks = [early_stopping],
          )

# Evaluate performance 
model.summary()
loss = model.evaluate(x_test, y_test, batch_size = batch_size)

# early_stopping.stopped_epoch returns 0 if training didn't stop early. 
print('Training stopped after',early_stopping.stopped_epoch,'epochs.')

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss vs. Epoch')
plt.ylabel('loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

prediction = model.predict(x_test)
prediction = scaler.inverse_transform(prediction)

y_test2 = np.reshape(y_test, (y_test.shape[0], 1))
y_test = scaler.inverse_transform(y_test2)

test_dates = adj_dates[-x_test.shape[0]:]

# Visualizing the results
plt.plot_date(test_dates, y_test, '-', linewidth = 2, color = 'red', label = 'Real DJIA Close')
plt.plot(test_dates, prediction, color = 'blue', label = 'Predicted Close')
plt.title('Close Prediction')
plt.xlabel('Time')
plt.ylabel('DJIA Close')
plt.legend()
plt.show()

# Generate future data 
time_horizon = sequence_length
# future_lookback = adj_dates[-time_horizon:]

last_n = x_test[-time_horizon:,:,:] # Find last n number of days
future_prediction = model.predict(last_n)
future_prediction2 = np.reshape(future_prediction, (future_prediction.shape[0], 1))
future_prediction3 = scaler.inverse_transform(future_prediction2)
future_prediction3 = np.reshape(future_prediction3, (future_prediction3.shape[0]))
 
full_dataset_numpy = np.array(close_data)
all_data = np.append(full_dataset_numpy, future_prediction3)
plt.plot(all_data, color = 'blue', label = 'All data')
plt.title('All data including predictions')
plt.xlabel('Time')
plt.ylabel('DJIA Close')
plt.legend()
plt.show()

# Generate dates for future predictions
# Begin at the last date in the dataset, then add 'time_horizon' many new dates
last_date = dates.iloc[-1] # String
timestamp_list = pd.date_range(last_date, periods = time_horizon).tolist() #List of timestamps

# Convert list of timestamps to list of strings 
datestring_list = [i.strftime("%Y-%m-%d") for i in timestamp_list] #List of strings

# Clip first value, which is already included in the dataset
datestring2 = mdates.datestr2num(datestring_list)

plt.plot_date(datestring2, future_prediction3, '-', color = 'blue', label = 'Predicted Close')
plt.title('DJIA Close Prediction')
plt.xlabel('Date')
plt.ylabel('Predicted Close')
plt.xticks(rotation = 45)
plt.legend()
plt.show()

Tensorflow Support Tensorflow Support · Accepted Answer · 2020-07-02T14:29:38

Case 1: At the start of your question, you mentioned "For example, I use days 0-29 to predict day 30, days 1-30 to predict day 31, etc. ".

Case 2: But in Question 3, you mentioned "For example, x_train[0] might include close values for days 0-29, and y_train[0] would include close values for days 30-60.".

Do you want to predict Closed Value of Next Day, or Closed Value of Next 30 Days.

For generating the Data for X and Y (Train and Test), you can use the function mentioned below:

def univariate_data(dataset, start_index, end_index, history_size, target_size):
  data = []
  labels = []

  start_index = start_index + history_size
  if end_index is None:
    end_index = len(dataset) - target_size

  for i in range(start_index, end_index):
    indices = range(i-history_size, i)
    # Reshape data from (history_size,) to (history_size, 1)
    data.append(np.reshape(dataset[indices], (history_size, 1)))
    labels.append(dataset[i+target_size])
  return np.array(data), np.array(labels)

The Value of the argument, history_size will be 30 and the value of target_size will be 1 for Case 1 and 30 for Case 2 (mentioned above).

You need to call that function once for Training and once for Testing as shown below:

univariate_past_history = 30

univariate_future_target = 1 or 30

x_train_uni, y_train_uni = univariate_data(data, 0, TRAIN_SPLIT,
                                           univariate_past_history,
                                           univariate_future_target)
x_val_uni, y_val_uni = univariate_data(data, TRAIN_SPLIT, None,
                                       univariate_past_history,
                                       univariate_future_target)

Please find this Tensorflow Tutorial which explains both Univariate (One Column) and Multi Variate (multiple columns) Time Series Analysis along with step by step Code, comprehensively.

Answering your questions in the sequence which you have asked:

Yes. Referring the Tutorial will help.
Yes, Accuracy is an invalid metric. You can use MAE or MSE, as shown below:

simple_lstm_model.compile(optimizer='adam', loss='mae')
We should use Numpy Arrays instead of Sequences.

Please let me know if you face any other issue and we will be Happy to help you.

Forecasting stocks with LSTM in Keras (Python 3.7, Tensorflow 2.1.0)

1 Answers