In the official Tensorflow Neural Machine Translation example (https://www.tensorflow.org/alpha/tutorials/text/nmt_with_attention), in the Encoder model, a GRU layer is defined.
However, the zero-padded values will be processed normally by the GRU as there is no masking applied. And in the Decoder I think that the situation is even worse, because the Attention over the padded values will play an important role in the final computation of the context vector. I think that in the definition of the loss function below, the zeroes are masked, but at this point it is too late and the outputs of both the encoder and the attention decoder will be "broken".
Am I missing something in the whole process? Shouldn't the normal way of implementing this be with masking the padded values?
