Bad validation results using LSTM Keras

Question

I am trying to predict the stock market movement (1=positive, 0=negative) of day T using T-time_steps samples. I have tried time_steps = 20, 50, 100, 300 and the following results are similar.

I have the following dataframe:

                Open       High        Low      Close      Volume  sentiment  Movement
Date
2009-01-02  51.349998  54.529999  51.070000  54.360001   7296400.0   0.084348       1.0
2009-01-05  55.730000  55.740002  53.029999  54.060001   9509800.0   0.104813       0.0
2009-01-06  54.549999  58.220001  53.750000  57.360001  11080100.0   0.185938       1.0
2009-01-07  56.290001  56.950001  55.349998  56.200001   7942700.0   0.047494       0.0
2009-01-08  54.990002  57.320000  54.580002  57.160000   6577900.0  -0.027938       1.0

The following dataframe is the same as above using MinMaxScaler(feature_range=(0, 1)) to normalize the data.

                Open      High       Low     Close    Volume  sentiment  Movement
Date
2009-01-02  0.001402  0.002215  0.001750  0.002973  0.110116   0.591978       1.0
2009-01-05  0.003604  0.002819  0.002748  0.002823  0.148730   0.625025       0.0
2009-01-06  0.003011  0.004059  0.003114  0.004480  0.176124   0.756025       1.0
2009-01-07  0.003885  0.003424  0.003928  0.003897  0.121391   0.532468       0.0
2009-01-08  0.003232  0.003609  0.003536  0.004380  0.097581   0.410660       1.0

Train: 2263 samples
Test: 252 samples

TIME_STEPS = 300

def create_dataset(X, y, time_steps=1):
    Xs, ys = [], []
    for i in range(len(X) - time_steps):
        v = X.iloc[i:(i + time_steps)].values
        Xs.append(v)
        ys.append(y.iloc[i + time_steps])
    return np.array(Xs), np.array(ys)


X_train, y_train = create_dataset(train, train.Movement, TIME_STEPS)
X_test, y_test = create_dataset(test, test.Movement, TIME_STEPS)

I have created a small LSTM model using keras as above:

model = Sequential()
model.add(LSTM(50, input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(Dense(1, activation='sigmoid'))

optimizer = optimizers.RMSprop()
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-2, patience=25)
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
history = model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1, verbose=1,shuffle=False)

model.summary()

The results seems to show some overfitting to the training data set, I already tried to add dropouts, add more layers, increase/decrease the number of neurons... With the increasing of epochs, the training accuracy can reach 90% without any problem but the validation remains the same (also the prediction).

Loss - MSE

Accuracy

I can not understand what is the problem...

theletz theletz · Accepted Answer · 2020-01-11T19:05:22

When you face overfitting it can be caused from one of the following problems:

small number of samples
the dimension of the problem is high (you can think about this as large number of parameters)

What can you do in order to deal with this problem?

small number of samples:

get more data!
data augmentations (this is more relevant in computer vision)

the dimension of the problem is high:

use less complex model (with smaller number of parameters)
dropout
make it hard for the model - for example, add noise to the data...

Those are the main ways...

In your case, you use LSTM which probably requires a lot of data. And you use a small dataset, with low diversity (the samples are similar to each other because you take 300 timestamps back).

I would start with a simpler model (classic machine learning classifier), and add rolling features (using pandas rolling - mean, std, etc...) with different windows size.

Bad validation results using LSTM Keras

1 Answers