I'm trying to use LSTM to predict how the Dow Jones Industrial Average will perform in coming months. I think it is appropriate to frame this as a time series scenario since the DJIA behaves like a stock, with my data values spread evenly in time. I'm only a beginner, so starting simply with only one feature (daily close value). Now I know that stocks are very random and it's hard to predict them well. And, the close value alone is not very informative... but I'll add other features later.
Dataset: DJIA historical data, Jan 28, 1985 - Jun 24, 2020, downloadable here: https://finance.yahoo.com/quote/%5EDJI/history?p=%5EDJI.
Visualization with matplotlib:
I use a series of close values (number = 'sequence_length') to predict the close value that immediately follows the series (sequence_length + 1). For example, I use days 0-29 to predict day 30, days 1-30 to predict day 31, etc. Put another way, I partition the data such that x_train[0] contains close values for days 0-29, and y_train[0] contains the single value for day 31. Ok. So this is the result I get after running the model on my test data:
Ostensibly great, but I'm wondering if this whole concept is flawed: is the model merely seeing the data repetitively, and not learning any underlying pattern? See below for DJIA close predictions for 7/2020 through 4/20201. It seems to me that the prediction curve mimics the exact shape of the testing data, falling below 20,000 points and all...
Questions
- Is this model valid? Is it a matter of changing parameters or reformatting data?
- How the heck do you evaluate a model like this? Apparently 'accuracy' is an invalid metric. See below for loss curve
- It was suggested that instead of using scalar close values for labels, I use sequences instead. For example, x_train[0] might include close values for days 0-29, and y_train[0] would include close values for days 30-60. I have been trying in vain to make this work and apparently have no idea how. I tried to make y_test and y_train Numpy arrays including arrays of sequence data - like this:
y_train, y_test = [], []
for i in range(sequence_length, len(training_set_scaled)):
y_train.append(training_set_scaled[i + sequence_length: sequence_length*2 + i, 0])
y_test.append(testing_set_scaled[i + sequence_length: sequence_length*2 + i, 0])
y_train = np.array(list(y_item for y_item in y_train))
y_test = np.array(list(y_item for y_item in y_test))
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).
Any help would be SO greatly appreciated, and perhaps we can all benefit ($). Joking... sort of.
The Code
df = pd.read_csv('DJIA_historical_data.csv') # 2D. Shape: (8924 examples, 7 features)
close_data = df['Close'] # 1D (examples, )
dates = df['Date'] # 1D (examples, )
adj_dates = mdates.datestr2num(dates) # Convert Pandas series to np array so matplotlib can plot
# Important parameter
sequence_length: int = 90 # Aka 'timesteps', or number of close values used to make each new prediction
# Split off the training set and scale it.
percent_training: float = 0.80
num_training_samples = ceil(percent_training*len(df)) # A whole number
training_set = df.iloc[:num_training_samples, 5:6].values # 2D, shape: (samples, 1 feature)
scaler = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = scaler.fit_transform(training_set) #Shape is 2D: (num_training_samples, 1)
# Build 3D training set. Final shape: (examples, sequence_length, 1)
x_train = np.array([training_set_scaled[i - sequence_length:i, 0] for i in range(sequence_length, len(training_set_scaled))])
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
# Build test sets
num_testing_samples: int = len(df) - x_train.shape[0] # Scalar value
testing_set = df.iloc[-num_testing_samples:, 5:6].values # 2D (examples, 1)
testing_set_scaled = scaler.fit_transform(testing_set) # 2D ndarray (examples, 1)
x_test = np.array([testing_set_scaled[i - sequence_length:i, 0] for i in range(sequence_length, len(testing_set_scaled))])
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1)) #3D shape: (examples-sequence_length, sequence_length, 1).
# Build 1D training labels (examples, )
y_train = np.array([training_set_scaled[i, 0] for i in range(sequence_length, len(training_set_scaled))])
y_test = np.array([testing_set_scaled[i, 0] for i in range(sequence_length, len(testing_set_scaled))]) # (examples-sequence_length, 1)
y_test = np.reshape(y_test, (y_test.shape[0])) #1D (examples, )
# Build Model
epochs: int = 150
batch_size: int = 32
LSTM_1 = LSTM(
units = 5, # I reduced model complexity because I thought it would reduce overfitting. No such luck
input_shape = (x_train.shape[1], 1),
return_sequences = False,
)
LSTM_2 = LSTM(
units = 10
)
model = Sequential()
model.add(LSTM_1) # Output shape: (batch_size, sequence_length, units)
model.add(Dropout(0.4))
# model.add(LSTM_2) # Output shape: ?
# model.add(Dropout(0.2))
model.add(Dense(1)) # Is linear activation appropriate here?
model.compile(loss = 'mean_squared_error',
optimizer = 'adam',
)
early_stopping = EarlyStopping(monitor='val_loss',
mode='min',
verbose = 1,
patience = 9,
restore_best_weights = False
)
history = model.fit(x_train,
y_train,
epochs = epochs,
batch_size = batch_size,
verbose = 2,
validation_split = 0.20,
# validation_data = (x_test, y_test),
callbacks = [early_stopping],
)
# Evaluate performance
model.summary()
loss = model.evaluate(x_test, y_test, batch_size = batch_size)
# early_stopping.stopped_epoch returns 0 if training didn't stop early.
print('Training stopped after',early_stopping.stopped_epoch,'epochs.')
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss vs. Epoch')
plt.ylabel('loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
prediction = model.predict(x_test)
prediction = scaler.inverse_transform(prediction)
y_test2 = np.reshape(y_test, (y_test.shape[0], 1))
y_test = scaler.inverse_transform(y_test2)
test_dates = adj_dates[-x_test.shape[0]:]
# Visualizing the results
plt.plot_date(test_dates, y_test, '-', linewidth = 2, color = 'red', label = 'Real DJIA Close')
plt.plot(test_dates, prediction, color = 'blue', label = 'Predicted Close')
plt.title('Close Prediction')
plt.xlabel('Time')
plt.ylabel('DJIA Close')
plt.legend()
plt.show()
# Generate future data
time_horizon = sequence_length
# future_lookback = adj_dates[-time_horizon:]
last_n = x_test[-time_horizon:,:,:] # Find last n number of days
future_prediction = model.predict(last_n)
future_prediction2 = np.reshape(future_prediction, (future_prediction.shape[0], 1))
future_prediction3 = scaler.inverse_transform(future_prediction2)
future_prediction3 = np.reshape(future_prediction3, (future_prediction3.shape[0]))
full_dataset_numpy = np.array(close_data)
all_data = np.append(full_dataset_numpy, future_prediction3)
plt.plot(all_data, color = 'blue', label = 'All data')
plt.title('All data including predictions')
plt.xlabel('Time')
plt.ylabel('DJIA Close')
plt.legend()
plt.show()
# Generate dates for future predictions
# Begin at the last date in the dataset, then add 'time_horizon' many new dates
last_date = dates.iloc[-1] # String
timestamp_list = pd.date_range(last_date, periods = time_horizon).tolist() #List of timestamps
# Convert list of timestamps to list of strings
datestring_list = [i.strftime("%Y-%m-%d") for i in timestamp_list] #List of strings
# Clip first value, which is already included in the dataset
datestring2 = mdates.datestr2num(datestring_list)
plt.plot_date(datestring2, future_prediction3, '-', color = 'blue', label = 'Predicted Close')
plt.title('DJIA Close Prediction')
plt.xlabel('Date')
plt.ylabel('Predicted Close')
plt.xticks(rotation = 45)
plt.legend()
plt.show()