I'm currently working on a Keras tutorial for recurrent network training and I'm having trouble understanding the Stateful LSTM concept. To keep things as simple as possible, the sequences have the same length seq_length
. As far as I get it, the input data is of shape (n_samples, seq_length, n_features)
and we then train our LSTM on n_samples/M
batches of size M
as follows:
For each batch:
- Feed in the 2D-tensors
(seq_length, n_features)
and for each input 2D-tensor compute the gradient - Sum these gradients to get the total gradient on the batch
- Backpropagate the gradient and update weights
In the tutorial's example, feeding in the 2D-tensors is feeding in a sequence of size seq_length
of letters encoded as vectors of length n_features
. However, the tutorial says that in the Keras implementation of LSTMs, the hidden state is not reset after a whole sequence (2D-tensor) is fed in, but after a batch of sequences is fed in to use more context.
Why does keeping the hidden state of the previous sequence and using it as initial hidden state for our current sequence improve the learning and the predictions on our test set, since that "previously learned" initial hidden state won't be available when making predictions ? Moreover, Keras' default behaviour is to shuffle input samples at the beginning of each epoch so the batch context is changed at each epoch. This behaviour seems contradictory to keeping the hidden state through a batch since batch context is random.