3
votes

In tensorflow/keras, we can simply set return_sequences = False for the last LSTM layer before the classification/fully connected/activation (softmax/sigmoid) layer to get rid of the temporal dimension.

In PyTorch, I don't find anything similar. For the classification task, I don't need a sequence to sequence model but many to one architecture like this:

enter image description here

Here's my simple bi-LSTM model.

import torch
from torch import nn

class BiLSTMClassifier(nn.Module):
    def __init__(self):
        super(BiLSTMClassifier, self).__init__()
        self.embedding = torch.nn.Embedding(num_embeddings = 65000, embedding_dim = 64)
        self.bilstm = torch.nn.LSTM(input_size = 64, hidden_size = 8, num_layers = 2,
                                    batch_first = True, dropout = 0.2, bidirectional = True)
        # as we have 5 classes
        self.linear = nn.Linear(8*2*512, 5) # last dimension
    def forward(self, x):
        x = self.embedding(x)
        print(x.shape)
        x, _ = self.bilstm(x)
        print(x.shape)
        x = self.linear(x.reshape(x.shape[0], -1))
        print(x.shape)

# create our model

bilstmclassifier = BiLSTMClassifier()

If I observe the shapes after each layer,

xx = torch.tensor(X_encoded[0]).reshape(1,512)
print(xx.shape) 
# torch.Size([1, 512])
bilstmclassifier(xx)
#torch.Size([1, 512, 64])
#torch.Size([1, 512, 16])
#torch.Size([1, 5])

What can I do so that the last LSTM returns a tensor with shape (1, 16) instead of (1, 512, 16)?

1
Just take the last element of that dimension? x = x[:, -1, :] where x is the LSTM output.xdurch0
Thanks, @xdurch0 seems like a straightforward solution. Is it the same thing as tensorflow return_sequences = False?Zabir Al Nazi
I decided to do a little digging and post a proper answer. tl;dr: Yes.xdurch0

1 Answers

6
votes

The simplest way to do this is by indexing into the tensor:

x = x[:, -1, :]

where x is the RNN output. Of course, if batch_first is False, one would have to use x[-1, :, :] (or just x[-1]) to index into the time axis instead. Turns out this is the same thing Tensorflow/Keras do. The relevant code can be found in K.rnn here:

last_output = tuple(o[-1] for o in outputs)

Note that the code at this point uses time_major data format, so the index is into the first axis. Also, outputs is a tuple because it can be multiple layers, state/cell pairs etc., but it is generally the sequence of outputs for all time steps.

This is then used in the RNN class as follows:

if self.return_sequences:
    output = K.maybe_convert_to_ragged(is_ragged_input, outputs, row_lengths)
else:
    output = last_output

So in total, we can see that return_sequences=False just uses outputs[-1].