Tensorflow Neural Machine Translation Example - Loss Function

Question

Im stepping through the code here: https://www.tensorflow.org/tutorials/text/nmt_with_attention as a learning method and I am confused as to when the loss function is called and what is passed. I added two print statements in the loss_function and when the training loop runs, it only prints out

(64,) (64, 4935)

at the very start multiple times and then nothing again. I am confused on two fronts:

Why doesnt the loss_function() get called repeatedly through the training loop and print the shapes? I expected that the loss function would get called at the end of each batch which is of size 64.
I expected the shapes of the actuals to be (batch size, time steps) and the predictions to be (batch size, time steps, vocabulary size). It looks like the loss gets called seperately for every time step (64 is the batch size and 4935 is the vocabulary size).

The relevant bits I believe are reproduced below.

    optimizer = tf.keras.optimizers.Adam()
    loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
    
    def loss_function(real, pred):
          mask = tf.math.logical_not(tf.math.equal(real, 0))
          
          print(real.shape)
          print(pred.shape)
    
    
          loss_ = loss_object(rea

l, pred) 
      mask = tf.cast(mask, dtype=loss_.dtype) 
      loss_ *= mask #set padding entries to zero loss
     
      return tf.reduce_mean(loss_)

    @tf.function
    def train_step(inp, targ, enc_hidden):
      loss = 0
    
      with tf.GradientTape() as tape:
        enc_output, enc_hidden = encoder(inp, enc_hidden)
    
        dec_hidden = enc_hidden
    
        dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)
    
        # Teacher forcing - feeding the target as the next input
        for t in range(1, targ.shape[1]):
          # passing enc_output to the decoder
          predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
          print(targ[:, t])
          print(predictions)
          loss += loss_function(targ[:, t], predictions)
    
          # using teacher forcing
          dec_input = tf.expand_dims(targ[:, t], 1)
    
      batch_loss = (loss / int(targ.shape[1]))
    
      variables = encoder.trainable_variables + decoder.trainable_variables
    
      gradients = tape.gradient(loss, variables)
    
      optimizer.apply_gradients(zip(gradients, variables))
    
      return batch_loss


    EPOCHS = 10
    
    for epoch in range(EPOCHS):
      start = time.time()
    
      enc_hidden = encoder.initialize_hidden_state()
      total_loss = 0
    
      for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
        #print(batch)    
        batch_loss = train_step(inp, targ, enc_hidden)
        total_loss += batch_loss
    
        if batch % 100 == 0:
          print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                       batch,
                                                       batch_loss.numpy()))
      # saving (checkpoint) the model every 2 epochs
      if (epoch + 1) % 2 == 0:
        checkpoint.save(file_prefix = checkpoint_prefix)
    
      print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                          total_loss / steps_per_epoch))
      print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

Anton Codes Anton Codes · Accepted Answer · 2020-11-26T20:59:23

The loss is treated similar to the rest of the graph. In tensorflow calls like tf.keras.layers.Dense and tf.nn.conv2d don't actually do the operation, but instead they define the graph for the operations. I have another post here How do backpropagation works in tensorflow that explains the backprop and some motivation of why this is.

The loss function you have above is

def loss_function(real, pred):
      mask = tf.math.logical_not(tf.math.equal(real, 0))
      
      print(real.shape)
      print(pred.shape)


      loss_ = loss_object(real, pred)
      mask = tf.cast(mask, dtype=loss_.dtype) 
      loss_ *= mask #set padding entries to zero loss
 
      result = tf.reduce_mean(loss_)
      return result

Think of this function as a generate that returns result. Result defines the graph to compute the loss. Perhaps a better name for this function would be loss_function_graph_creator ... but that's another story.

Result, which is a graph that contains weights, bias, and information about how to both do the forward propagation and the back propagation is all model.fit needs. It no longer needs this function and it doesn't need to run the function every loop.

Truly, what is happening under the covers is that given your model (called my_model), the compile line

model.compile(loss=loss_function, optimizer='sgd')

is effectively the following lines

input = tf.keras.Input()
output = my_model(input)
loss = loss_function(input,output)
opt = tf.keras.optimizers.SGD()
gradient = opt.minimize(loss)

get_gradient_model = tf.keras.Model(input,gradient)

and there you have the gradient operation which can be use in a loop to get the gradients, which is conceptually what model.fit does.

Q and A

Is the fact that this function: @tf.function def train_step(inp, targ, enc_hidden): has the tf.function decorator (and the loss function is called in it) what makes this code run as you describe and not normal python?

No. It is not 'normal' python. It only defines the flow of tensors through the graph of matrix operations that will (hopefully) run on your GPU. All the tensorflow operations just set up the operations on the GPU (or a simulated GPU if you don't have one).

How can I tell the actual shapes being passed into loss_function (the second part of my question)?

No problem at all... simply run this code

loss_function(y, y).shape

This will compute the loss function of your expected output compared exactly to the same output. The loss will (hopefully) be zero, but actually calculating the value of the loss wasn't the point. You want the shape and this will give it to you.

Tensorflow Neural Machine Translation Example - Loss Function

1 Answers

Q and A