Difference between these implementations of LSTM Autoencoder?

Question

Specifically what spurred this question is the return_sequence argument of TensorFlow's version of an LSTM layer.

The docs say:

Boolean. Whether to return the last output. in the output sequence, or the full sequence. Default: False.

I've seen some implementations, especially autoencoders that use this argument to strip everything but the last element in the output sequence as the output of the 'encoder' half of the autoencoder.

Below are three different implementations. I'd like to understand the reasons behind the differences, as the seem like very large differences but all call themselves the same thing.

Example 1 (TensorFlow):

This implementation strips away all outputs of the LSTM except the last element of the sequence, and then repeats that element some number of times to reconstruct the sequence:

model = Sequential()
model.add(LSTM(100, activation='relu', input_shape=(n_in,1)))
# Decoder below
model.add(RepeatVector(n_out))
model.add(LSTM(100, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(1)))

When looking at implementations of autoencoders in PyTorch, I don't see authors doing this. Instead they use the entire output of the LSTM for the encoder (sometimes followed by a dense layer and sometimes not).

Example 1 (PyTorch):

This implementation trains an embedding BEFORE an LSTM layer is applied... It seems to almost defeat the idea of an LSTM based auto-encoder... The sequence is already encoded by the time it hits the LSTM layer.

class EncoderLSTM(nn.Module):
  def __init__(self, input_size, hidden_size, n_layers=1, drop_prob=0):
    super(EncoderLSTM, self).__init__()
    self.hidden_size = hidden_size
    self.n_layers = n_layers

    self.embedding = nn.Embedding(input_size, hidden_size)
    self.lstm = nn.LSTM(hidden_size, hidden_size, n_layers, dropout=drop_prob, batch_first=True)

  def forward(self, inputs, hidden):
    # Embed input words
    embedded = self.embedding(inputs)
    # Pass the embedded word vectors into LSTM and return all outputs
    output, hidden = self.lstm(embedded, hidden)
    return output, hidden

Example 2 (PyTorch):

This example encoder first expands the input with one LSTM layer, then does its compression via a second LSTM layer with a smaller number of hidden nodes. Besides the expansion, this seems in line with this paper I found: https://arxiv.org/pdf/1607.00148.pdf

However, in this implementation's decoder, there is no final dense layer. The decoding happens through a second lstm layer that expands the encoding back to the same dimension as the original input. See it here. This is not in line with the paper (although I don't know if the paper is authoritative or not).

class Encoder(nn.Module):
  def __init__(self, seq_len, n_features, embedding_dim=64):
    super(Encoder, self).__init__()
    self.seq_len, self.n_features = seq_len, n_features
    self.embedding_dim, self.hidden_dim = embedding_dim, 2 * embedding_dim
    self.rnn1 = nn.LSTM(
      input_size=n_features,
      hidden_size=self.hidden_dim,
      num_layers=1,
      batch_first=True
    )
    self.rnn2 = nn.LSTM(
      input_size=self.hidden_dim,
      hidden_size=embedding_dim,
      num_layers=1,
      batch_first=True
    )
  def forward(self, x):
    x = x.reshape((1, self.seq_len, self.n_features))
    x, (_, _) = self.rnn1(x)
    x, (hidden_n, _) = self.rnn2(x)
    return hidden_n.reshape((self.n_features, self.embedding_dim))

Question:

I'm wondering about this discrepancy in implementations. The difference seems quite large. Are all of these valid ways to accomplish the same thing? Or are some of these mis-guided attempts at a "real" LSTM autoencoder?

Luca Angioloni Luca Angioloni · Accepted Answer · 2020-12-08T15:38:19

There is no official or correct way of designing the architecture of an LSTM based autoencoder... The only specifics the name provides is that the model should be an Autoencoder and that it should use an LSTM layer somewhere.

The implementations you found are each different and unique on their own even though they could be used for the same task.

Let's describe them:

TF implementation:
- It assumes the input has only one channel, meaning that each element in the sequence is just a number and that this is already preprocessed.
- The default behaviour of the LSTM layer in Keras/TF is to output only the last output of the LSTM, you could set it to output all the output steps with the return_sequences parameter.
- In this case the input data has been shrank to (batch_size, LSTM_units)
- Consider that the last output of an LSTM is of course a function of the previous outputs (specifically if it is a stateful LSTM)
- It applies a Dense(1) in the last layer in order to get the same shape as the input.
PyTorch 1:
- They apply an embedding to the input before it is fed to the LSTM.
- This is standard practice and it helps for example to transform each input element to a vector form (see word2vec for example where in a text sequence, each word that isn't a vector is mapped into a vector space). It is only a preprocessing step so that the data has a more meaningful form.
- This does not defeat the idea of the LSTM autoencoder, because the embedding is applied independently to each element of the input sequence, so it is not encoded when it enters the LSTM layer.
PyTorch 2:
- In this case the input shape is not (seq_len, 1) as in the first TF example, so the decoder doesn't need a dense after. The author used a number of units in the LSTM layer equal to the input shape.

In the end you choose the architecture of your model depending on the data you want to train on, specifically: the nature (text, audio, images), the input shape, the amount of data you have and so on...