Tensorflow NMT with Attention Tutorial -- need help understanding loss function

Question

I'm following Tensorflow's Neural Machine Translation with Attention tutorial (link) but am unclear about some implementation details. It'd be great if someone could help clarify or refer me to a source/better place to ask:

1) def loss_function(real, pred): This function computes loss at a specific time step (say t), averaged over the entire the batch. Examples whose labels at t is <pad> (i.e. no real data, only padded so that all example sequences are of same length) are masked so as not to count towards loss.

My question: It seems loss should get smaller the bigger t is (since more examples are <pad> the further we get to maximum length). So why is loss averaged over the entire batch, and not just over the number of valid (non-<pad>) examples? (This is analogous to using tf.losses.Reduction.SUM_BY_NONZERO_WEIGHTS instead of tf.losses.Reduction.SUM_OVER_BATCH_SIZE)

2) for epoch in range(EPOCHS) ——> Two loss variables are defined in the training loop:

loss = sum of loss_function() outputs over all time steps
batch_loss = loss divided by number of time steps

My question: Why are gradients computed w.r.t. loss and not batch_loss? Shouldn't batch_loss be the average loss over all time steps and the entire batch?

Many thanks!

It seems to me you are correct and that it's more appropriate to calculate the loss the way you suggested. — Ohad Rubin

Yash Katariya Yash Katariya · Accepted Answer · 2018-11-20T16:22:08

It seems loss should get smaller the bigger t

The loss does get smaller since the pad token is getting masked while calculating the loss.

Batch_loss is used only to print the loss calculated of each batch. Batch loss is calculated for every batch and across all the time steps.

for t in range(1, targ.shape[1])

This loop runs over the batch for all timesteps and calculates the loss by masking the padded values.

I hope this clears it up :)

Tensorflow NMT with Attention Tutorial -- need help understanding loss function

1 Answers