All Gradient values calculated as "None" if using BCE loss manually

Question

I am working on a multi-output model where I need to weigh all output losses before calculating the overall loss. I have a customized model. fit() training loop to achieve this.

As I need to calculate the sample-wise loss for all four outputs and fuse these sample-wise losses after applying weight, I have customized the standard code. Now, the loss is calculating sample-wise but while calculating the gradient, all gradient values are calculated as "None". I tried to put tape.watch(loss) also, but it is not working. Kindly, help me to fix this issue.

class CustomModel(keras.Model):
    def train_step(self, data):
        print(tf.executing_eagerly())
        # Unpack the data. Its structure depends on your model and
        # on what you pass to `fit()`.
        x, y = data
        alpha = 0.1
        loss = 0
        y_pred_all = []

        with tf.GradientTape() as tape:
            bce = tf.keras.losses.BinaryCrossentropy(reduction=tf.keras.losses.Reduction.NONE)
            for spl in range(1 if np.shape(x)[0] == None else np.shape(x)[0]):
                tape.watch(loss)
                tape.watch(loss_mean)
                tape.watch(loss_element)
                x_spl = np.reshape(x[spl], (1, np.shape(x)[1], np.shape(x)[2], np.shape(x)[3]))
                y_pred = self(x_spl, training=True)  # Forward pass
                y_pred_all.append(y_pred)
                loss_element = bce(y[spl], y_pred)
                loss_mean = [np.mean(loss_element[0]), np.mean(loss_element[1]), np.mean(loss_element[2]), np.mean(loss_element[3])]
                id = np.argmin(loss_mean)
                for i, ele in enumerate(loss_mean):
                    if i == id:
                        loss_mean[i] *= 1
                    else:
                        loss_mean[i] *= alpha

                loss = loss + np.sum(loss_mean)

        # Compute gradients
        trainable_vars = self.trainable_variables

        gradients = tape.gradient(loss, trainable_vars)

        # Update weights
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))

        # Update metrics (includes the metric that tracks the loss)
        self.compiled_metrics.update_state(y, y_pred_all)
        # Return a dict mapping metric names to current value
        return {m.name: m.result() for m in self.metrics}

UPDATE I did few changes as suggested by @rvinas Now it is calculating the gradient without any error but I am not sure if the changes I did are correct or not:

class CustomModel(keras.Model):
    def train_step(self, data):
        # print(tf.executing_eagerly())
        # Unpack the data. Its structure depends on your model and
        # on what you pass to `fit()`.
        x, y = data
        alpha = 0.1
        loss = tf.Variable(0, dtype='float32')
        y_pred_all = []

        with tf.GradientTape() as tape:
            bce = tf.keras.losses.BinaryCrossentropy(reduction=tf.keras.losses.Reduction.NONE)
            for spl in tf.range(1 if tf.shape(x)[0] == None else tf.shape(x)[0]):
                loss_mean=tf.convert_to_tensor([])
                x_spl =  tf.reshape(x[spl], (1, tf.shape(x)[1], tf.shape(x)[2], tf.shape(x)[3]))
                y_pred = self(x_spl, training=True)  # Forward pass
                y_pred_all.append(y_pred)
                loss_element = bce(y[spl], y_pred)
                loss_mean = [tf.reduce_mean(loss_element[0]), tf.reduce_mean(loss_element[1]), tf.reduce_mean(loss_element[2]), tf.reduce_mean(loss_element[3])]

                id = tf.argmin(loss_mean)
                for i, ele in enumerate(loss_mean):
                    if i == id:
                        loss_mean[i] = tf.multiply(loss_mean[i], 1)
                    else:
                        loss_mean[i] = tf.multiply(loss_mean[i], alpha)

                loss = tf.add(loss, tf.add(tf.add(tf.add(loss_mean[0],loss_mean[1]), loss_mean[2]), loss_mean[3]))

        # Compute gradients
        trainable_vars = self.trainable_variables

        gradients = tape.gradient(loss, trainable_vars)

        # Update weights
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))

        # Update metrics (includes the metric that tracks the loss)
        self.compiled_metrics.update_state(y, y_pred_all)
        # Return a dict mapping metric names to current value
        return {m.name: m.result() for m in self.metrics}

You should not be using NumPy operations (e.g. np.sum, np.reshape, ...) - this results in a disconnected graph. Instead use tensorflow operations only. — rvinas
@rvinas can you please suggest what should be done here to fix the issue. I am very new to TF. So, I do not have info about the TF operations. I used NumPy operations here I need to manipulate/weigh each output branch loss. — skiii gairola
At first glance, it is difficult to understand how you're weighting each element of the loss. Ideally, you should have a tensor weights with the same shape as loss (i.e., (batch_size, nb_elements)) and compute the final weighted loss with something along the lines of tf.reduce_mean(weights * loss). The "for loop" within the gradient tape block should ideally be avoided. — rvinas
@rvinas Actually, I am trying to implement a paper and the paper says that we will calculate the sample-wise loss for each scale/output. In my case, # of outputs is 4. So, whichever loss(out of 4 output losses) is smallest (for a sample) gets a weight of 1 and the remaining three losses get the weight of 0.1. For example, For a sample, if output_loss = [2, 5, 1, 3] (here 4 elements in the list represents 4 loss values corresponding to 4 outputs), according to the weighing logic final_loss = (2*0.1) + (5*0.1) + (1 * 1) + (3*0.1) — skiii gairola
I hope the provided solution helps - note that I am only using TF operations — rvinas

rvinas rvinas · Accepted Answer · 2021-04-05T09:54:47

The NaN gradients occur because you are using NumPy operations (e.g. np.sum, np.reshape, ...), which results in a disconnected graph. Instead one needs to implement the logic using tensorflow operations only.

For example, one could implement the weightings described in the comments section as follows:

bce = tf.keras.losses.BinaryCrossentropy(reduction=tf.keras.losses.Reduction.NONE)
with tf.GradientTape() as tape:
    # Compute element-wise losses
    y_pred = self(x, training=True)
    losses = bce(y, y_pred)  # Shape=(bs, 4)

    # Find maximum loss for each sample
    idx_max = tf.argmax(losses, axis=-1)  # Shape=(bs,)
    idx_max_onehot = tf.one_hot(idx_max, depth=y.shape[-1])  # Shape=(bs, 4)

    # Create weights tensor
    weight_max = 1
    weight_others = 0.1
    weights = idx_max_onehot * weight_max + (1 - idx_max_onehot) * weight_others

    # Aggregate losses
    losses = tf.reduce_sum(weights * losses, axis=-1)
    loss = tf.reduce_mean(losses)

All Gradient values calculated as "None" if using BCE loss manually

1 Answers