4
votes

I'm currently working on a Keras tutorial for recurrent network training and I'm having trouble understanding the Stateful LSTM concept. To keep things as simple as possible, the sequences have the same length seq_length. As far as I get it, the input data is of shape (n_samples, seq_length, n_features) and we then train our LSTM on n_samples/M batches of size M as follows:

For each batch:

  1. Feed in the 2D-tensors (seq_length, n_features) and for each input 2D-tensor compute the gradient
  2. Sum these gradients to get the total gradient on the batch
  3. Backpropagate the gradient and update weights

In the tutorial's example, feeding in the 2D-tensors is feeding in a sequence of size seq_length of letters encoded as vectors of length n_features. However, the tutorial says that in the Keras implementation of LSTMs, the hidden state is not reset after a whole sequence (2D-tensor) is fed in, but after a batch of sequences is fed in to use more context.

Why does keeping the hidden state of the previous sequence and using it as initial hidden state for our current sequence improve the learning and the predictions on our test set, since that "previously learned" initial hidden state won't be available when making predictions ? Moreover, Keras' default behaviour is to shuffle input samples at the beginning of each epoch so the batch context is changed at each epoch. This behaviour seems contradictory to keeping the hidden state through a batch since batch context is random.

1

1 Answers

5
votes

LSTMs in Keras aren't stateful by default - each sequence starts with newly-reset states. By setting stateful=True in your recurrent layer, successive inputs in a batch don't reset the network state. This assumes that the sequences are actually successive, and it means that in a (very informal) sense, you're training on sequences of length batch_size * seq_length.

Why does keeping the hidden state of the previous sequence and using it as initial hidden state for our current sequence improve the learning and the predictions on our test set, since that "previously learned" initial hidden state won't be available when making predictions ?

In theory, it improves learning because a longer context can teach the network things about the distribution that are still relevant when testing on the individually shorter sequences. If the network is learning some probability distribution, that distribution should hold over different sequence lengths.

Moreover, Keras's default behaviour is to shuffle input samples at the beginning of each epoch so the batch context is changed at each epoch. This behaviour seems contradictory to keeping the hidden state through a batch since batch context is random.

I haven't checked, but I assume that when stateful=True, only batches are shuffled - not the sequences within them.

In general, when we give the network some initial state, we don't mean for that to be a universally better starting point. It just means that the network can take the information from previous sequences into account when training.