If I understood correctly, you have batches of m
sequences, each of length 7, whose elements are 3-dimensional vectors (so batch has shape (m*7*3)
).
In any Keras RNN you can set the
return_sequences
flag to True
to become the intermediate states, i.e., for every batch, instead of the definitive prediction, you will get the corresponding 7 outputs, where output i
represents the prediction at stage i
given all inputs from 0 to i
.
But you would be getting all at once at the end. As far as I know, Keras doesn't provide a direct interface for retrieving the throughput whilst the batch is being processed. This may be even more constrained if you are using any of the CUDNN
-optimized variants. What you can do is basically to regard your batch as 7 succesive batches of shape (m*1*3)
, and feed them progressively to your LSTM, recording the hidden state and prediction at each step. For that, you can either set return_state
to True
and do it manually, or you can simply set stateful
to True
and let the object keep track of it.
The following Python2+Keras example should exactly represent what you want. Specifically:
- allowing to save the whole LSTM intermediate state in a persistent way
- while waiting for the next sample
- and predicting on a model trained on a specific batch size that may be arbitrary and unknown.
For that, it includes an example of stateful=True
for easiest training, and return_state=True
for most precise inference, so you get a flavor of both approaches. It also assumes that you get a model that has been serialized and from which you don't know much about. The structure is closely related to the one in Andrew Ng's course, who is definitely more authoritative than me in the topic. Since you don't specify how the model has been trained, I assumed a many-to-one training setup, but this could be easily adapted.
from __future__ import print_function
from keras.layers import Input, LSTM, Dense
from keras.models import Model, load_model
from keras.optimizers import Adam
import numpy as np
# globals
SEQ_LEN = 7
HID_DIMS = 32
OUTPUT_DIMS = 3 # outputs are assumed to be scalars
##############################################################################
# define the model to be trained on a fixed batch size:
# assume many-to-one training setup (otherwise set return_sequences=True)
TRAIN_BATCH_SIZE = 20
x_in = Input(batch_shape=[TRAIN_BATCH_SIZE, SEQ_LEN, 3])
lstm = LSTM(HID_DIMS, activation="tanh", return_sequences=False, stateful=True)
dense = Dense(OUTPUT_DIMS, activation='linear')
m_train = Model(inputs=x_in, outputs=dense(lstm(x_in)))
m_train.summary()
# a dummy batch of training data of shape (TRAIN_BATCH_SIZE, SEQ_LEN, 3), with targets of shape (TRAIN_BATCH_SIZE, 3):
batch123 = np.repeat([[1, 2, 3]], SEQ_LEN, axis=0).reshape(1, SEQ_LEN, 3).repeat(TRAIN_BATCH_SIZE, axis=0)
targets = np.repeat([[123,234,345]], TRAIN_BATCH_SIZE, axis=0) # dummy [[1,2,3],,,]-> [123,234,345] mapping to be learned
# train the model on a fixed batch size and save it
print(">> INFERECE BEFORE TRAINING MODEL:", m_train.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))
m_train.compile(optimizer=Adam(lr=0.5), loss='mean_squared_error', metrics=['mae'])
m_train.fit(batch123, targets, epochs=100, batch_size=TRAIN_BATCH_SIZE)
m_train.save("trained_lstm.h5")
print(">> INFERECE AFTER TRAINING MODEL:", m_train.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))
##############################################################################
# Now, although we aren't training anymore, we want to do step-wise predictions
# that do alter the inner state of the model, and keep track of that.
m_trained = load_model("trained_lstm.h5")
print(">> INFERECE AFTER RELOADING TRAINED MODEL:", m_trained.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))
# now define an analogous model that allows a flexible batch size for inference:
x_in = Input(shape=[SEQ_LEN, 3])
h_in = Input(shape=[HID_DIMS])
c_in = Input(shape=[HID_DIMS])
pred_lstm = LSTM(HID_DIMS, activation="tanh", return_sequences=False, return_state=True, name="lstm_infer")
h, cc, c = pred_lstm(x_in, initial_state=[h_in, c_in])
prediction = Dense(OUTPUT_DIMS, activation='linear', name="dense_infer")(h)
m_inference = Model(inputs=[x_in, h_in, c_in], outputs=[prediction, h,cc,c])
# Let's confirm that this model is able to load the trained parameters:
# first, check that the performance from scratch is not good:
print(">> INFERENCE BEFORE SWAPPING MODEL:")
predictions, hs, zs, cs = m_inference.predict([batch123,
np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)),
np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))],
batch_size=1)
print(predictions)
# import state from the trained model state and check that it works:
print(">> INFERENCE AFTER SWAPPING MODEL:")
for layer in m_trained.layers:
if "lstm" in layer.name:
m_inference.get_layer("lstm_infer").set_weights(layer.get_weights())
elif "dense" in layer.name:
m_inference.get_layer("dense_infer").set_weights(layer.get_weights())
predictions, _, _, _ = m_inference.predict([batch123,
np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)),
np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))],
batch_size=1)
print(predictions)
# finally perform granular predictions while keeping the recurrent activations. Starting the sequence with zeros is a common practice, but depending on how you trained, you might have an <END_OF_SEQUENCE> character that you might want to propagate instead:
h, c = np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)), np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))
for i in range(len(batch123)):
# about output shape: https://keras.io/layers/recurrent/#rnn
# h,z,c hold the network's throughput: h is the proper LSTM output, c is the accumulator and cc is (probably) the candidate
current_input = batch123[i:i+1] # the length of this feed is arbitrary, doesn't have to be 1
pred, h, cc, c = m_inference.predict([current_input, h, c])
print("input:", current_input)
print("output:", pred)
print(h.shape, cc.shape, c.shape)
raw_input("do something with your prediction and hidden state and press any key to continue")
Additional information:
Since we have two forms of state persistency:
1. The saved/trained parameters of the model that are the same for each sequence
2. The a
, c
states that evolve throughout the sequences and may be "restarted"
It is interesting to take a look at the guts of the LSTM object. In the Python example that I provide, the a
and c
weights are explicitly handled, but the trained parameters aren't, and it may not be obvious how they are internally implemented or what do they mean. They can be inspected as follows:
for w in lstm.weights:
print(w.name, w.shape)
In our case (32 hidden states) returns the following:
lstm_1/kernel:0 (3, 128)
lstm_1/recurrent_kernel:0 (32, 128)
lstm_1/bias:0 (128,)
We observe a dimensionality of 128. Why is that? this link describes the Keras LSTM implementation as follows:
The g is the recurrent activation, p is the activation, Ws are the kernels, Us are the recurrent kernels, h is the hidden variable which is the output too and the notation * is an element-wise multiplication.
Which explains the 128=32*4
being the parameters for the affine transformation happening inside each one of the 4 gates, concatenated:
- The matrix of shape
(3, 128)
(named kernel
) handles the input for a given sequence element
- The matrix of shape
(32, 128)
(named recurrent_kernel
) handles the input for the last recurrent state h
.
- The vector of shape
(128,)
(named bias
), as usual in any other NN setup.
stateful=True
, by copying the weights of training model to inference model and enable statefulness there as @DanielMöller 's answer does. – today