LSTM/GRU autoencoder convergency

Question

Goal

Trying to run an LSTM autoencoder over a dataset of multi variate time series:
X_train (200, 23, 178) - X_val (100, 23, 178) - X_test (100, 23, 178)

Current situation

A plain autoencoder gets better results rather than a simple architecture of a LSTM AE.

I have some doubts about how I use the Repeat Vector wrapper layer which, as far as I understood, is supposed to simply repeat a number of times equal to the sequence length the last state of the LSTM/GRU cell, in order to feed the input shape of the decoder layer.

The model architecture does not rise any error, but still the results are an order of magnitude worst than a simple AE, while I was expecting them to be at least the same, as I am using an architecture which should better fit the temporal problem.

Are these results comparable, first of all?

Nevertheless, the reconstruction error of the LSTM-AE does not look good at all.

My AE model:

Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 178)               31862     
_________________________________________________________________
batch_normalization (BatchNo (None, 178)               712       
_________________________________________________________________
dense_1 (Dense)              (None, 59)                10561     
_________________________________________________________________
dense_2 (Dense)              (None, 178)               10680     
=================================================================

optimizer: sgd
loss: mse
activation function of the dense layers: relu

My LSTM/GRU AE:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 23, 178)           0         
_________________________________________________________________
gru (GRU)                    (None, 59)                42126     
_________________________________________________________________
repeat_vector (RepeatVector) (None, 23, 59)            0         
_________________________________________________________________
gru_1 (GRU)                  (None, 23, 178)           127092    
_________________________________________________________________
time_distributed (TimeDistri (None, 23, 178)           31862     
=================================================================

optimizer: sgd
loss: mse
activation function of the gru layers: relu

Were you able to make progress on this? I'd be interested in how you were able to improve the reconstruction quality, if you succeeded. — Kappers
Just managed to improve both data quality and samples. Did not manage anything further by model complexity. — Guido
Interesting - what exactly did it require? For example, new data-preprocessing, increasing training samples etc. — Kappers
Sorry for delay. Yes, I increased the training set with synthetic examples — Guido

Pedro Marques Pedro Marques · Accepted Answer · 2019-03-04T07:39:55

The 2 models you have above do not seem to be comparable, in a meaningful way. The first model is attempting to compress your vector of 178 values. It is quite possible that these vectors contain some redundant information so it is reasonable to assume that you will be able to compress them.

The second model is attempting to compress a sequence of 23 x 178 vectors via single GRU layer. This is a task with a significantly higher number of parameters. The repeat vector simply takes the output of the 1st GRU layer (the encoder) and makes it in input of the 2nd GRU layer (the decoder). But then you take a single value of the decoder. Instead of the TimeDistributed layer, I'd recommend that you use return_sequences=True in the 2nd GRU (decoder). Otherwise you are saying that you are expecting that the 23x178 sequence is constituted with elements all with the same value; that has to lead to a very high error / no solution.

I'd recommend you take a step back. Is your goal to find similarity between the sequences ? Or to be able to make predictions ? An auto-encoder approach is preferable for a similarity task. In order to make predictions, I'd recommend that you go more towards an approach where you apply a Dense(1) layer to the output of the sequences step.

Is your data-set open ? available ? I'd be curious on taking it for a spin if that would be possible.