I'm following Tensorflow's Neural Machine Translation with Attention tutorial (link) but am unclear about some implementation details. It'd be great if someone could help clarify or refer me to a source/better place to ask:
1) def loss_function(real, pred): This function computes loss at a specific time step (say t), averaged over the entire the batch. Examples whose labels at t is <pad> (i.e. no real data, only padded so that all example sequences are of same length) are masked so as not to count towards loss.
My question: It seems loss should get smaller the bigger t is (since more examples are <pad> the further we get to maximum length). So why is loss averaged over the entire batch, and not just over the number of valid (non-<pad>) examples? (This is analogous to using tf.losses.Reduction.SUM_BY_NONZERO_WEIGHTS instead of tf.losses.Reduction.SUM_OVER_BATCH_SIZE)
2) for epoch in range(EPOCHS) ——> Two loss variables are defined in the training loop:
loss= sum ofloss_function()outputs over all time stepsbatch_loss=lossdivided by number of time steps
My question: Why are gradients computed w.r.t. loss and not batch_loss? Shouldn't batch_loss be the average loss over all time steps and the entire batch?
Many thanks!