Truncated Backpropagation in keras with one sequence per batch

Question

If I understood correctly, to perform TBPTT in keras we have to split our sequences into smaller parts of k timesteps. To re-use the state of our LSTM accross all the parts of the sequence we have to use the stateful parameter, according to the documentation of keras :

You can set RNN layers to be 'stateful', which means that the states computed for the samples in one batch will be reused as initial states for the samples in the next batch. This assumes a one-to-one mapping between samples in different successive batches.

So if I understand correctly the 1st sample of the 1st batch is the 1s part of the 1st sequence, the 1st sample of the 2nd batch is the 2nd part of the 1 sequence, etc. I have 125973 sequences of length 1000 that I split into 40 sequences of k=25 timesteps. So my model should train on 40 batches containing 125973 sequences of 25 timesteps. My issue is the memory of my GPU (quadro K2200, I'm poor), a batch size of 125973 seems to be too much. I'd like to know if it is possible to keep the state of the LSTM inside the same batch and reset it between batches, so I should have a batch size of 40 and 125973 batches instead.

Here is my model:

model = Sequential()
model.add(Embedding(len(char_to_num), 200, mask_zero=True, batch_input_shape=(batch_size, k)))
model.add(Dropout(0.5))
model.add(LSTM(512, activation='relu', return_sequences=True, stateful=True))
model.add(Dropout(0.5))
model.add(TimeDistributed(Dense(len(char_to_num), activation='softmax')))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
model.summary()

EDIT 2021
Recent answers have been made this year, but it's kind of an old question. The state of libraries, DL, and NLP have changed a lot in the meantime and I've moved on from LSTM to Transformers. I haven't used an LSTM in years and I don't plan nor have the time to test the answers posted.

Kirgsn Kirgsn · Accepted Answer · 2021-02-18T10:40:37

Your batch size is flexible in so far that it must divide P = 125973. If there is not such a number (since P is a prime number for example) then just add dummy sequences filled with thousand zeros each. In case of added dummy sequences, make sure to ignore them during training by adding an appropriate "sample_weights" nd-array to model.fit() (where real sequences are masked with "1" and dummy sequences with "0"), and call model.compile(.., sample_weight_mode='temporal').

Then, for resetting states in between batches, go for keras callbacks:

# N must be divisible by batch_size
N = 40*126000  # number of time series snippets (sequences + dummies)
batch_size = 50  # processing 50 sequences at a time

class StateResetter(tf.keras.callbacks.Callback):
    def on_train_batch_end(self, batch, logs={}):
        # reset states if we processed a set of sequences
        if (batch+1) % 40 == 0:
            self.model.get_layer('my_lstm_layer').reset_states()

# input_data.shape = (N, 25, num_features)
model.fit(input_data, labels, batch_size=batch_size, 
          callbacks=[StateResetter], sample_weight=sample_weight)

I guess you should be able to figure out how to shape input_data accordingly.

Truncated Backpropagation in keras with one sequence per batch

2 Answers