1
votes

I'm following Tensorflow's Neural Machine Translation with Attention tutorial (link) but am unclear about some implementation details. It'd be great if someone could help clarify or refer me to a source/better place to ask:

1) def loss_function(real, pred): This function computes loss at a specific time step (say t), averaged over the entire the batch. Examples whose labels at t is <pad> (i.e. no real data, only padded so that all example sequences are of same length) are masked so as not to count towards loss.

My question: It seems loss should get smaller the bigger t is (since more examples are <pad> the further we get to maximum length). So why is loss averaged over the entire batch, and not just over the number of valid (non-<pad>) examples? (This is analogous to using tf.losses.Reduction.SUM_BY_NONZERO_WEIGHTS instead of tf.losses.Reduction.SUM_OVER_BATCH_SIZE)

2) for epoch in range(EPOCHS) ——> Two loss variables are defined in the training loop:

  • loss = sum of loss_function() outputs over all time steps
  • batch_loss = loss divided by number of time steps

My question: Why are gradients computed w.r.t. loss and not batch_loss? Shouldn't batch_loss be the average loss over all time steps and the entire batch?

Many thanks!

1
It seems to me you are correct and that it's more appropriate to calculate the loss the way you suggested. - Ohad Rubin

1 Answers

0
votes

It seems loss should get smaller the bigger t

The loss does get smaller since the pad token is getting masked while calculating the loss.

Batch_loss is used only to print the loss calculated of each batch. Batch loss is calculated for every batch and across all the time steps.

for t in range(1, targ.shape[1])

This loop runs over the batch for all timesteps and calculates the loss by masking the padded values.

I hope this clears it up :)