I am trying to predict the stock market movement (1=positive, 0=negative) of day T using T-time_steps samples. I have tried time_steps = 20, 50, 100, 300 and the following results are similar.
I have the following dataframe:
Open High Low Close Volume sentiment Movement
Date
2009-01-02 51.349998 54.529999 51.070000 54.360001 7296400.0 0.084348 1.0
2009-01-05 55.730000 55.740002 53.029999 54.060001 9509800.0 0.104813 0.0
2009-01-06 54.549999 58.220001 53.750000 57.360001 11080100.0 0.185938 1.0
2009-01-07 56.290001 56.950001 55.349998 56.200001 7942700.0 0.047494 0.0
2009-01-08 54.990002 57.320000 54.580002 57.160000 6577900.0 -0.027938 1.0
The following dataframe is the same as above using MinMaxScaler(feature_range=(0, 1)) to normalize the data.
Open High Low Close Volume sentiment Movement
Date
2009-01-02 0.001402 0.002215 0.001750 0.002973 0.110116 0.591978 1.0
2009-01-05 0.003604 0.002819 0.002748 0.002823 0.148730 0.625025 0.0
2009-01-06 0.003011 0.004059 0.003114 0.004480 0.176124 0.756025 1.0
2009-01-07 0.003885 0.003424 0.003928 0.003897 0.121391 0.532468 0.0
2009-01-08 0.003232 0.003609 0.003536 0.004380 0.097581 0.410660 1.0
Train: 2263 samples
Test: 252 samples
TIME_STEPS = 300
def create_dataset(X, y, time_steps=1):
Xs, ys = [], []
for i in range(len(X) - time_steps):
v = X.iloc[i:(i + time_steps)].values
Xs.append(v)
ys.append(y.iloc[i + time_steps])
return np.array(Xs), np.array(ys)
X_train, y_train = create_dataset(train, train.Movement, TIME_STEPS)
X_test, y_test = create_dataset(test, test.Movement, TIME_STEPS)
I have created a small LSTM model using keras as above:
model = Sequential()
model.add(LSTM(50, input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(Dense(1, activation='sigmoid'))
optimizer = optimizers.RMSprop()
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-2, patience=25)
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
history = model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1, verbose=1,shuffle=False)
model.summary()
The results seems to show some overfitting to the training data set, I already tried to add dropouts, add more layers, increase/decrease the number of neurons... With the increasing of epochs, the training accuracy can reach 90% without any problem but the validation remains the same (also the prediction).
I can not understand what is the problem...