RNN not training when batch size > 1 with variable length data

Question

I'm implementing a simple RNN network which predicts 1/0 for some variable length time-series data. The network would first feed the training data into an LSTM cell, and then use a linear layer for classification.

Usually, we would use mini-batches to train the network. But, the problem is that this simple RNN network is not training when I use batch_size > 1.

I manage to create a minimal code sample which can reproduce the problem. If you set batch_size=1 at line 95, the network trains successfully, but if you set batch_size=2, the network is not training at all, the losses just bouncing around. (requires python3, pytorch >= 0.4.0)

import numpy as np
import random
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence


class ToyDataLoader(object):

    def __init__(self, batch_size):
        self.batch_size = batch_size
        self.index = 0
        self.dataset_size = 10

        # generate 10 random variable length training samples,
        # each time step has 1 feature dimension
        self.X = [
            [[1], [1], [1], [1], [0], [0], [1], [1], [1]],
            [[1], [1], [1], [1]],
            [[0], [0], [1], [1]],
            [[1], [1], [1], [1], [1], [1], [1]],
            [[1], [1]],
            [[0]],
            [[0], [0], [0], [0], [0], [0], [0]],
            [[1]],
            [[0], [1]],
            [[1], [0]]
        ]

        # assign labels for the toy traning set
        self.y = torch.LongTensor([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

    def __len__(self):
        return self.dataset_size // self.batch_size

    def __iter__(self):
        return self

    def __next__(self):
        if self.index + self.batch_size > self.dataset_size:
            self.index = 0
            raise StopIteration()
        if self.index == 0:  # shufle the dataset
            tmp = list(zip(self.X, self.y))
            random.shuffle(tmp)
            self.X, self.y = zip(*tmp)
            self.y = torch.LongTensor(self.y)
        X = self.X[self.index: self.index + self.batch_size]
        y = self.y[self.index: self.index + self.batch_size]
        self.index += self.batch_size
        return X, y


class NaiveRNN(nn.Module):
    def __init__(self):
        super(NaiveRNN, self).__init__()
        self.lstm = nn.LSTM(1, 128)
        self.linear = nn.Linear(128, 2)

    def forward(self, X):
        '''
        Parameter:
            X: list containing variable length training data
        '''

        # get the length of each seq in the batch
        seq_lengths = [len(x) for x in X]

        # convert to torch.Tensor
        seq_tensor = [torch.Tensor(seq) for seq in X]

        # sort seq_lengths and seq_tensor based on seq_lengths, required by torch.nn.utils.rnn.pad_sequence
        pairs = sorted(zip(seq_lengths, seq_tensor),
                       key=lambda pair: pair[0], reverse=True)
        seq_lengths = torch.LongTensor([pair[0] for pair in pairs])
        seq_tensor = [pair[1] for pair in pairs]

        # padded_seq shape: (seq_len, batch_size, feature_size)
        padded_seq = pad_sequence(seq_tensor)

        # pack them up
        packed_seq = pack_padded_sequence(padded_seq, seq_lengths.numpy())

        # feed to rnn
        packed_output, (ht, ct) = self.lstm(packed_seq)

        # linear classification layer
        y_pred = self.linear(ht[-1])

        return y_pred


def main():
    trainloader = ToyDataLoader(batch_size=2)  # not training at all! !!
    # trainloader = ToyDataLoader(batch_size=1) # it converges !!!

    model = NaiveRNN()
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adadelta(model.parameters(), lr=1.0)

    for epoch in range(30):
        # switch to train mode
        model.train()

        for i, (X, labels) in enumerate(trainloader):

            # compute output
            outputs = model(X)
            loss = criterion(outputs, labels)

            # measure accuracy and record loss
            _, predicted = torch.max(outputs, 1)
            accu = (predicted == labels).sum().item() / labels.shape[0]

            # compute gradient and do SGD step
            optimizer.zero_grad()
            loss.backward()

            optimizer.step()

            print('Epoch: [{}][{}/{}]\tLoss {:.4f}\tAccu {:.3f}'.format(
                epoch, i, len(trainloader), loss, accu))


if __name__ == '__main__':
    main()

Sample output when batch_size=1:

...
Epoch: [28][7/10]       Loss 0.1582     Accu 1.000
Epoch: [28][8/10]       Loss 0.2718     Accu 1.000
Epoch: [28][9/10]       Loss 0.0000     Accu 1.000
Epoch: [29][0/10]       Loss 0.2808     Accu 1.000
Epoch: [29][1/10]       Loss 0.0000     Accu 1.000
Epoch: [29][2/10]       Loss 0.0001     Accu 1.000
Epoch: [29][3/10]       Loss 0.0149     Accu 1.000
Epoch: [29][4/10]       Loss 0.1445     Accu 1.000
Epoch: [29][5/10]       Loss 0.2866     Accu 1.000
Epoch: [29][6/10]       Loss 0.0170     Accu 1.000
Epoch: [29][7/10]       Loss 0.0869     Accu 1.000
Epoch: [29][8/10]       Loss 0.0000     Accu 1.000
Epoch: [29][9/10]       Loss 0.0498     Accu 1.000

Sample output when batch_size=2:

...
Epoch: [27][2/5]        Loss 0.8051     Accu 0.000
Epoch: [27][3/5]        Loss 1.2835     Accu 0.000
Epoch: [27][4/5]        Loss 1.0782     Accu 0.000
Epoch: [28][0/5]        Loss 0.5201     Accu 1.000
Epoch: [28][1/5]        Loss 0.6587     Accu 0.500
Epoch: [28][2/5]        Loss 0.3488     Accu 1.000
Epoch: [28][3/5]        Loss 0.5413     Accu 0.500
Epoch: [28][4/5]        Loss 0.6769     Accu 0.500
Epoch: [29][0/5]        Loss 1.0434     Accu 0.000
Epoch: [29][1/5]        Loss 0.4460     Accu 1.000
Epoch: [29][2/5]        Loss 0.9879     Accu 0.000
Epoch: [29][3/5]        Loss 1.0784     Accu 0.500
Epoch: [29][4/5]        Loss 0.6051     Accu 1.000

I've searched a lot of materials and still can't figure out why.

Your loss looks horrible for both batch sizes. I don't think that the problem is something to do with different batch sizes. — Hadus
Your learning rate is too high! (probably) :) Try with 0.001 — Hadus
@Hadus, do you mean the loss seems bouncing around even when batch_size=1? I think that's because I use variable length sequences, so the network can hardly learn a general model in such random data. And I believe the learning rate is alright, Adadelta optimizer needs such big learning rate. I have tried learning rate like 0.1, 0.001, 0.0001, the results remain the same. — Marvis Lu

arosa arosa · Accepted Answer · 2018-06-06T18:21:47

I think one major problem is that you are passing ht[-1] as input to the linear layer.
The ht[-1] will contain the state from the last time-step which will only be valid for the input with maximum length.

To solve this you need to unpack the output and get the output corresponding to the last length of that corresponding input.
Here is how we need to be changed:

# feed to rnn
packed_output, (ht, ct) = self.lstm(packed_seq)

# Unpack output
lstm_out, seq_len = pad_packed_sequence(packed_output)

# get vector containing last input indices
last_input = seq_len - 1

indices = torch.linspace(0, (seq_len.size(0)-1), steps=seq_len.size(0)).long()

# linear classification layer
y_pred = self.linear(lstm_out[last_input, indices, :])
return y_pred

I still wasn't able to make it converge with the remaining parameters but this should help.

RNN not training when batch size > 1 with variable length data

1 Answers