
I am trying to implement a simple many to many LSTM for Sequence Prediction. The problem is very easy. The input is a sequence of 0s and 1s. The output at each time step is the count of ones in the sequence until that time step. For example assume the input is [0 1 0 1]. The output of the given input would be time0=0, time1=1, time2=1, time3=2. I should note that I use One hot encoding to represent the output.

Assumptions: the length of the input sequence is 20 (so at most I can have 20 ones in the sequence). Therefore, I consider 21 classes for output (one hot encoding). Class 0 means there is no one in the sequence. Class 21 shows that we have 20 ones in the sequence.

So far, I use the following model:

# create LSTM

model = tf.keras.models.Sequential()

model.add(tf.keras.layers.LSTM(30, input_shape=(20, 1), return_sequences=True ))
#model.add(tf.keras.layers.LSTM(30, input_shape=(20, 1)))
print (model.input_shape)
print (model.output_shape)

#model.add(tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(21, activation='softmax')))
model.add(tf.keras.layers.Dense(21, activation='softmax'))



I evaluated it by adding and removing "tf.keras.layers.TimeDistributed". Both of them reach the same accuracy of 99%! I am wondering why is that? So when we need to use "TimeDistributed"? What is it for then?


1 Answers


For Dense layer you don't have to use TimeDistributed because the kernel gets broadcasted. For example you have (30, 21) as your W and (batch, 20, 30) as your x, so when you multiply the kernal gets broadcasted multiplied with every minibatch entry and you end up with (batch, 20, 30) times (30, 21) gives you (batch, 20, 21). The equation is Wx here.

You use TimeDistributed when you have more complicated layer or even a model. Imagine a CNN model which you want to apply to every frame of the video. Then you could TimeDistributed to it's full potential.