Keras Bidirectional LSTM - Layer grouping

Question

While working to implement a paper (Dialogue Act Sequence Labeling using Hierarchical encoder with CRF) using Keras, I need to implement a specific Bidirectional LSTM architecture.

I have to train the network on the concept of a Conversation. Conversations are composed of Utterances, and Utterances are composed of Words. Words are N-dimensional vectors. The model represented in the paper first reduces each Utterance to a single M-dimensional vector. To achieve this, it uses a Bidirectional LSTM layer. Let's call this layer A.

(For simplicity, let's assume that each Utterance has a length of |U| and each Conversation has a length of |C|)

Each Utterance is input to a Bi-LSTM layer with U timesteps, and the output of the last timestep is taken. The input size is (|U|, N), and the output size is (1, M).

This Bi-LSTM layer should be applied separately/simultaneously to each Utterance in the Conversation. Note that, since the network takes as input the entire Conversation, the dimensions for a single input to the network would be (|C|, |U|, N).

As the paper describes, I intend to feed each utterance (i.e. each (|U|, N)) of that input and feed it to a Bi-LSTM layer with |U| units. As there are |C| Utterances in a Conversation, this implies that there should be a total of |C| x |U| Bi-LSTM units, grouped into |C| different partitions for each Utterance. There should be no connection between the |C| groups of units. Once processed, the output of each of those C groups of Bidirectional LSTM units will then be fed into another Bi-LSTM layer, say B.

How is it possible to feed specific portions of the input only to specific portions of the layer A, and make sure that they are not interconnected? (i.e. the portion of Bi-LSTM units used for an Utterance should not be connected to the Bi-LSTM units used for another Utterance)

Is it possible to achieve this through keras.models.Sequential, or is there a specific way to achieve this using Functional API?

Here is what I have tried so far:

# ...
model = Sequential()
model.add(Bidirectional(LSTM(C * U), input_shape = (C, U, N),
                        merge_mode='concat'))
model.add(GlobalMaxPooling1D())
model.add(Bidirectional(LSTM(n, return_sequences = True), merge_mode='concat'))
# ...

model.compile(loss = loss_function,
              optimizer = optimizer,
              metrics=['accuracy'])

However, this code is currently receiving the following error:

ValueError: Input 0 is incompatible with layer bidirectional_1: expected ndim=3, found ndim=4

More importantly, the code above obviously does not do the grouping I mentioned. I am looking for a way to enhance the model as I described above.

Finally, below is the figure of the model I described above. It may possibly help clarify some of the written content narrated above. The layer tagged as "Utterance layer" is what I called the layer A. As you can see in the figure, each utterance u_i in the figure is composed of words w_j, which are N-dimensional vectors. (You may omit the embedding layer for the purposes of this question) Assuming, for simplicity, that each u_i has equal number of Words, then each group of Bidirectional LSTM nodes in the Utterance Layer will have an input size of (|U|, N). Yet, since there are |C| such utterances u_i in a Conversation, the dimensions of the entire input will be (|C|, |U|, N).

@Maxim That doesn't necessarily apply to my question. If you read all of the admittedly lengthy description of the problem, the issue was never about batch size. A batch in my case would consist of multiple Conversations. So, (|C|, |U|, N) is exactly a single input from a batch of multiple such Conversations. If we call batch size as |B|, then I would have to say batch_input_shape=(|B|, |C|, |U|, N). It is another way to declare what I already am declaring, but it is irrelevant to the core of my problem. — ilim
Do the input is intentionally rank=4? Plain Keras LSTM doesn't support it — Maxim
This part of your question is "very" obscure: "As the paper describes, I intend to group each utterance (i.e. each (|U|, N)) of that input and feed it to a Bi-LSTM layer with |U| units. This means that there should be |C| x |U| different Bi-LSTM units, grouped into |C| different partitions for each Utterance. The output of each of those C groups will then be fed into another Bi-LSTM layer, say B." -- What do you mean by "having |U| units? Do you mean steps in the output? And then |C| x |U| units? Your picture doesn't seem to show any of this. — Daniel Möller
@DanielMöller You are right, I guess. I edited that paragraph of my question to clarify it a bit. Also, I added some more explanation to the last paragraph. (i.e. the one where I describe the figure and how it corresponds to my question) Hope it helps clarify things a bit. — ilim
I think there is a misunderstanding about how it works.... there is no need or implication to have "C x U" units in the model. Unless you have a clear reason for that, other than creating what is in the picture, it's not necessary at all. — Daniel Möller

Daniel Möller Daniel Möller · Accepted Answer · 2018-04-16T13:10:54

I'll create a net for what I see in the picture. For now I'm ignoring the "units" part I mentioned in my comment to your question.

This model does exactly what is shown in the picture. All utterances are completely separate from start to end.

model = Sequential()

#You have an extra time dimension that should be kept as is
#So we add a TimeDistributed` wrapper to the first layers

model.add(TimeDistributed(Embedding(dictionaryLength,N), input_shape=(C,U)))

#This is the utterance layer. It works in "word steps", keeping "utterance steps" untouched    
model.add(TimeDistributed(Bidirectional(LSTM(M//2, return_sequences=False))))

#Is the pooling really demanded by the article?
#Or was it an attempt to remove one of the time dimensions?
#Not adding it here because I used `return_sequences=False` 


model.add(Bidirectional(LSTM(someSize//2,return_sequences=True)))
model.add(Dense(anotherSize)) #is this a CRF layer???

model.summary()

Notice that in every Bidirectional layer, I divided the output size by two, so it's important that M and someSize are even numbers.

Keras Bidirectional LSTM - Layer grouping

1 Answers