1
votes

In the official Tensorflow Neural Machine Translation example (https://www.tensorflow.org/alpha/tutorials/text/nmt_with_attention), in the Encoder model, a GRU layer is defined.

However, the zero-padded values will be processed normally by the GRU as there is no masking applied. And in the Decoder I think that the situation is even worse, because the Attention over the padded values will play an important role in the final computation of the context vector. I think that in the definition of the loss function below, the zeroes are masked, but at this point it is too late and the outputs of both the encoder and the attention decoder will be "broken".

Am I missing something in the whole process? Shouldn't the normal way of implementing this be with masking the padded values?

1

1 Answers

0
votes

You are right, you can see that when you print the tensor returned from the encoder that the numbers on the right side of the differ although most of it comes from the padding: enter image description here

Usual implementation indeed includes masking. You would then use the mask in computing the attention weights in the next cell. The simplest way is adding something like to (1 - mask) * 1e9 to the attention logits in the score tensor. The tutorial is a very basic one. For instance, the text prepreprocessing is very simple (remove all non-ASCII characters), or the tokenization differs from what is usual in machine translation.