How to represent data for LSTMs?

Question

I have sequence data that tells me what color was observed for multiple subjects at different points in time. For example:

ID	Time	Color
A	1	Blue
A	2	Red
A	5	Red
B	3	Blue
B	6	Green
C	1	Red
C	3	Orange

I want to obtain predictions for the most likely color for the next 3 time steps, as well as the probability of that color appearing. For example, for ID A, I'd like to know the next 3 items (time, color) in the sequence as well as its probability of the color appearing.

I understand that LSTMs are often used to predict this type of sequential data, and that I would feed in a 3d array like

input =[ 
[[1,1], [2,2], [5,2]], #blue at t=1, red at t=2, red at t=5 for ID A
[[0,0], [3,1], [6,3]], #nothing for first entry, blue at t=3, green at t=6 for ID B
[[0,0], [1,2], [3,4]]
]

after mapping the colors to numbers (Blue-> 1, Red->2, Green-> 3, Orange -> 4etc.). My understanding is that, by default, the LSTM just predicts the next item in each sequence, so for example

output = [
[[7, 2]], #next item is most likely red at t=7
[[9, 3]], # next item is most likely red at t=9
[[6, 2]] 
]

Is it possible to modify the output of my LSTM so that instead of just predicting the next occurence time and color, I can get the next 3 times, colors AND probabilities of the color appearing? For example, an output like

output = [
[[7, 2, 0.93], [8,2, 0.79], [10,4, 0.67]], 
[[9, 2, 0.88], [11,3, 0.70], [14,3, 0.43]], 
...
]

I've tried looking in the Sequential documentation for Keras, but I'm not sure if I've found anything.

Furthermore, I see that there's a TrainX and TrainY typically used for model.fit() but I'm also not sure what my TrainY would be here?

Sequential is unrelated to sequences, it is just an interface to stack layers (a better name would have been Model). — runDOSrun

Akshay Sehgal Akshay Sehgal · Accepted Answer · 2021-02-25T00:36:48

LSTM just predicts the next item in each sequence...

Not really, LSTM is just a layer to help encode sequential data. It's the downstream task, (the dense layers, and output layers) that determine what is the model going to "predict".

While you can train an LSTM based model to predict the next value in a sequence (by cleverly keeping the last timestamp as your regression target, y), you would ideally want to use an LSTM based encoder-decoder architecture to properly generate sequences from input sequences.

This is the same architecture that is used for language models to generate text or machine translation models to translate English to French.

You can find a good tutorial on implementing this here. The advantage of this model is, that you can now choose to decode as many time steps as you need. So for your case, you can input a padded, fixed-length sequence of colors to the encoder, and decode 3-time steps.

From a data-prep point of view, you will have to take each sequence of colors, remove the last 3 colors as your y, and pad the rest to a fixed-length

sample = [R, G, B, B, R, G, R, R, B]
X = [<start>, 0, 0, 0, 0, 0, R, G, B, B, R, G, <end>]  #Padded input sequence
y = [<start>, R, R, B, <end>]                          #Y sequence

You will find the necessary preprocessing, training, and inferences steps in the link above.

How to represent data for LSTMs?

1 Answers