python - Stateful LSTM and stream predictions

Question

I've trained an LSTM model (built with Keras and TF) on multiple batches of 7 samples with 3 features each, with a shape the like below sample (numbers below are just placeholders for the purpose of explanation), each batch is labeled 0 or 1:

Data:

[
   [[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
   [[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
   [[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
   ...
]

i.e: batches of m sequences, each of length 7, whose elements are 3-dimensional vectors (so batch has shape (m*7*3))

Target:

[
   [1]
   [0]
   [1]
   ...
]

On my production environment data is a stream of samples with 3 features ([1,2,3],[1,2,3]...). I would like to stream each sample as it arrives to my model and get the intermediate probability without waiting for the entire batch (7) - see the animation below.

One of my thoughts was padding the batch with 0 for the missing samples, [[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[1,2,3]] but that seems to be inefficient.

Will appreciate any help that will point me in the right direction of both saving the LSTM intermediate state in a persistent way, while waiting for the next sample and predicting on a model trained on a specific batch size with partial data.

Update, including model code:

opt = optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=10e-8, decay=0.001)
model = Sequential()

num_features = data.shape[2]
num_samples = data.shape[1]

first_lstm = LSTM(32, batch_input_shape=(None, num_samples, num_features), return_sequences=True, activation='tanh')
model.add(
    first_lstm)
model.add(LeakyReLU())
model.add(Dropout(0.2))
model.add(LSTM(16, return_sequences=True, activation='tanh'))
model.add(Dropout(0.2))
model.add(LeakyReLU())
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer=opt,
              metrics=['accuracy', keras_metrics.precision(), keras_metrics.recall(), f1])

Model Summary:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_1 (LSTM)                (None, 100, 32)           6272      
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU)    (None, 100, 32)           0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 100, 32)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 100, 16)           3136      
_________________________________________________________________
dropout_2 (Dropout)          (None, 100, 16)           0         
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU)    (None, 100, 16)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 1600)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 1601      
=================================================================
Total params: 11,009
Trainable params: 11,009
Non-trainable params: 0
_________________________________________________________________

@ShlomiSchwartz I added a note at the beginning of my answer. Please read that to make sure you are on the right track. — today
@ShlomiSchwartz This might help to understand the difference. — today
@ShlomiSchwartz It does not matter whether your training model is stateful or not. You can use the solutions involving setting stateful=True, by copying the weights of training model to inference model and enable statefulness there as @DanielMöller 's answer does. — today
Hello @ShlomiSchwartz, I updated my answer considering your model architecture, please take a look. — Daniel Möller
If the batches are parts of a longer sequence, your model should be stateful, or you should be using batches containing "entire" sequences. Notice that in all cases, the models will NOT see any relation between sequences in the SAME batch. Stateful models connects one batch to another batch. I suggest a detailed look at my answer in the link to understand exactly how keras interprets your data: stackoverflow.com/questions/38714959/understanding-keras-lstms/… — Daniel Möller

Daniel Möller Daniel Möller · Accepted Answer · 2018-11-14T15:30:12

I think there might be an easier solution.

If your model does not have convolutional layers or any other layers that act upon the length/steps dimension, you can simply mark it as stateful=True

Warning: your model has layers that act on the length dimension !!

The Flatten layer transforms the length dimension into a feature dimension. This will completely prevent you from achieving your goal. If the Flatten layer is expecting 7 steps, you will always need 7 steps.

So, before applying my answer below, fix your model to not use the Flatten layer. Instead, it can just remove the return_sequences=True for the last LSTM layer.

The following code fixed that and also prepares a few things to be used with the answer below:

def createModel(forTraining):

    #model for training, stateful=False, any batch size   
    if forTraining == True:
        batchSize = None
        stateful = False

    #model for predicting, stateful=True, fixed batch size
    else:
        batchSize = 1
        stateful = True

    model = Sequential()

    first_lstm = LSTM(32, 
        batch_input_shape=(batchSize, num_samples, num_features), 
        return_sequences=True, activation='tanh', 
        stateful=stateful)   

    model.add(first_lstm)
    model.add(LeakyReLU())
    model.add(Dropout(0.2))

    #this is the last LSTM layer, use return_sequences=False
    model.add(LSTM(16, return_sequences=False, stateful=stateful,  activation='tanh'))

    model.add(Dropout(0.2))
    model.add(LeakyReLU())

    #don't add a Flatten!!!
    #model.add(Flatten())

    model.add(Dense(1, activation='sigmoid'))

    if forTraining == True:
        compileThisModel(model)

With this, you will be able to train with 7 steps and predict with one step. Otherwise it will not be possible.

The usage of a stateful model as a solution for your question

First, train this new model again, because it has no Flatten layer:

trainingModel = createModel(forTraining=True)
trainThisModel(trainingModel)

Now, with this trained model, you can simply create a new model exactly the same way you created the trained model, but marking stateful=True in all its LSTM layers. And we should copy the weights from the trained model.

Since these new layers will need a fixed batch size (Keras' rules), I assumed it would be 1 (one single stream is coming, not m streams) and added it to the model creation above.

predictingModel = createModel(forTraining=False)
predictingModel.set_weights(trainingModel.get_weights())

And voilà. Just predict the outputs of the model with a single step:

pseudo for loop as samples arrive to your model:
    prob = predictingModel.predict_on_batch(sample)

    #where sample.shape == (1, 1, 3)

When you decide that you reached the end of what you consider a continuous sequence, call predictingModel.reset_states() so you can safely start a new sequence without the model thinking it should be mended at the end of the previous one.

Saving and loading states

Just get and set them, saving with h5py:

def saveStates(model, saveName):

    f = h5py.File(saveName,'w')

    for l, lay in enumerate(model.layers):
        #if you have nested models, 
            #consider making this recurrent testing for layers in layers
        if isinstance(lay,RNN):
            for s, stat in enumerate(lay.states):
                f.create_dataset('states_' + str(l) + '_' + str(s),
                                 data=K.eval(stat), 
                                 dtype=K.dtype(stat))

    f.close()


def loadStates(model, saveName):

    f = h5py.File(saveName, 'r')
    allStates = list(f.keys())

    for stateKey in allStates:
        name, layer, state = stateKey.split('_')
        layer = int(layer)
        state = int(state)

        K.set_value(model.layers[layer].states[state], f.get(stateKey))

    f.close()

Working test for saving/loading states

import h5py, numpy as np
from keras.layers import RNN, LSTM, Dense, Input
from keras.models import Model
import keras.backend as K




def createModel():
    inp = Input(batch_shape=(1,None,3))
    out = LSTM(5,return_sequences=True, stateful=True)(inp)
    out = LSTM(2, stateful=True)(out)
    out = Dense(1)(out)
    model = Model(inp,out)
    return model


def saveStates(model, saveName):

    f = h5py.File(saveName,'w')

    for l, lay in enumerate(model.layers):
        #if you have nested models, consider making this recurrent testing for layers in layers
        if isinstance(lay,RNN):
            for s, stat in enumerate(lay.states):
                f.create_dataset('states_' + str(l) + '_' + str(s), data=K.eval(stat), dtype=K.dtype(stat))

    f.close()


def loadStates(model, saveName):

    f = h5py.File(saveName, 'r')
    allStates = list(f.keys())

    for stateKey in allStates:
        name, layer, state = stateKey.split('_')
        layer = int(layer)
        state = int(state)

        K.set_value(model.layers[layer].states[state], f.get(stateKey))

    f.close()

def printStates(model):

    for l in model.layers:
        #if you have nested models, consider making this recurrent testing for layers in layers
        if isinstance(l,RNN):
            for s in l.states:
                print(K.eval(s))   

model1 = createModel()
model2 = createModel()
model1.predict_on_batch(np.ones((1,5,3))) #changes model 1 states

print('model1')
printStates(model1)
print('model2')
printStates(model2)

saveStates(model1,'testStates5')
loadStates(model2,'testStates5')

print('model1')
printStates(model1)
print('model2')
printStates(model2)

Considerations on the aspects of the data

In your first model (if it is stateful=False), it considers that each sequence in m is individual and not connected to the others. It also considers that each batch contains unique sequences.

If this is not the case, you might want to train the stateful model instead (considering that each sequence is actually connected to the previous sequence). And then you would need m batches of 1 sequence. -> m x (1, 7 or None, 3).