15
votes

I am trying to reconstruct time series data with LSTM Autoencoder (Keras). Now I want train autoencoder on small amount of samples (5 samples, every sample is 500 time-steps long and have 1 dimension). I want to make sure that model can reconstruct that 5 samples and after that I will use all data (6000 samples).

window_size = 500
features = 1
data = data.reshape(5, window_size, features)

model = Sequential()

model.add(LSTM(256, input_shape=(window_size, features), 
return_sequences=True))
model.add(LSTM(128, input_shape=(window_size, features), 
return_sequences=False))
model.add(RepeatVector(window_size))

model.add(LSTM(128, input_shape=(window_size, features), 
return_sequences=True))
model.add(LSTM(256, input_shape=(window_size, features), 
return_sequences=True))
model.add(TimeDistributed(Dense(1)))

model.compile(optimizer='adam', loss='mse')
model.fit(data, data, epochs=100, verbose=1)

Model

Training:

Epoch 1/100
5/5 [==============================] - 2s 384ms/step - loss: 0.1603
...
Epoch 100/100
5/5 [==============================] - 2s 388ms/step - loss: 0.0018

After training, I tried reconstruct one of 5 samples:

yhat = model.predict(np.expand_dims(data[1,:,:], axis=0), verbose=0)

Reconstitution: Blue
Input: Orange

Reconstion (blue) vs Input (orange)

Why is reconstruction so bad when loss is small? How can I make model better? Thanks.

2
Would you show all graphs from data[0,:,:] to data[4,:,:]?Daniel Möller

2 Answers

5
votes

It seems to me, a time series should be given to the LSTMs in this format:

 (samples, features , window_size)

So, if you change the format, for example I exchanged the variables, and look at the results:

enter image description here

Code for reproducing the result(I didn't change the name of the variables, so please don't be confused :)):

import numpy as np
import keras
from keras import Sequential
from keras.layers import Dense, RepeatVector,        TimeDistributed
from keras.layers import LSTM

N = 10000
data = np.random.uniform(-0.1, 0.1, size=(N, 500))
data = data.cumsum(axis=1)
print(data.shape)
window_size = 1
features = 500
data = data.reshape(N, window_size, features)

model = Sequential()

model.add(LSTM(32, input_shape=
(window_size,features), 
return_sequences=True))
model.add(LSTM(16, input_shape=(window_size,   
features), 
return_sequences=False))
model.add(RepeatVector(window_size))

model.add(LSTM(16, input_shape=(window_size, 
features), 
return_sequences=True))
model.add(LSTM(32, input_shape=(window_size,   
features), 
return_sequences=True))
model.add(TimeDistributed(Dense(500)))

model.compile(optimizer='adam', loss='mse')
model.fit(data, data, epochs=100, verbose=1)


yhat = model.predict(np.expand_dims(data[1,:,:],   axis=0), verbose=0)
plot(np.arange(500), yhat[0,0,:])
plot(np.arange(500), data[1,0,:])

Credit to sobe86: I used the proposed data by him/her.

2
votes

I tried running your code on the following data

data = np.random.uniform(-0.1, 0.1, size=(5, 500))
data = data.cumsum(axis=1)

so the data is just the cumalative sum of some random uniform noise. I ran for 1000 epochs, and my results are not as bad as yours, the LSTM seems to make some effort to follow the line, though it seems to just be hovering around the running mean (as one might expect).

blah

Note that this is running the model on the TRAINING data (which you seem to imply you were doing in your question) - if we try to look at performance on data that the model was not trained on, we can get bad results.

blah

This is not surprising in the least, with such a small training set, we should fully expect the model to overfit, and not generalise to new data.