1
votes

I've been working on a CNN to localize a coin in an image. The CNN outputs a bounding box for the coin (x_min, y_min, x_max, y_max) and a few probabilities image_contains_coin, image_doesnt_contain_coin, coin_too_close (if the coin is too close to the camera), and dirty_coin if the coin is dirty (not clean/shiny).

So so I'm using the results of the last nn layer's mat mul for the bounding box outputs directly and the probabilities go through an extra sigmoid function. If there is no coin in the image, then the error of the bounding box and the 2 remaining probabilities should be ignored.

The error is calculated as RMS error of the bounding box plus the cross entropy error of the sigmoid/probability outputs.

I was wondering if this (the following code) is the correct way to do this kind of thing in tensorflow? and if there are any problems with my ideas or code?

Just to see where the outputs are in the array:

output_x1 = 0
output_y1 = 1
output_x2 = 2
output_y2 = 3
output_is_coin = 4
output_is_not_coin = 5
output_is_too_close = 6
output_is_dirty = 7
num_outputs = 8

And the output layer and cost model:

with tf.variable_scope('output'):
  # Output, class prediction
  out = cnn.output_layer(fc1, 256, num_outputs, 'out')
  out_for_error = out

  # Slice up the outputs and add a sigmoid to the probabilities
  aabb_out = tf.slice(out, [0,0], [-1,4])
  prob_out = tf.slice(out, [0,4], [-1,4])

  prob_out = tf.nn.sigmoid(prob_out)

self.out = tf.concat([aabb_out, prob_out], 1, 'O')

with tf.variable_scope('error'):
    # if the image is not of a plate, then the error for all other outputs needs
    # to be ignored. This is done by replacing those components of the output
    # array with the desired values (from training data) rather than the output
    # of the nn
    image_is_plate          = tf.slice(image_infos, [0,4], [-1,1])

    is_plate_mask           = tf.constant([1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0])
    not_plate_mask          = tf.constant([0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0])

    error_mask              = tf.add(tf.multiply(is_plate_mask, image_is_plate), not_plate_mask)
    inv_error_mask          = tf.subtract(tf.ones([1,1]), error_mask)

    masked_error = tf.multiply(error_mask, out_for_error) + tf.multiply(inv_error_mask, image_infos)

    aabb_error = tf.slice(masked_error, [0,0], [-1,4])
    prob_error = tf.slice(masked_error, [0,4], [-1,4])

    # slice the outputs to run RMS error on the bounding box and cross entropy 
    # on the probabilities
    image_infos_aabbs = tf.slice(image_infos, [0,0], [-1,4])
    image_infos_probs = tf.slice(image_infos, [0,4], [-1,4])

    self.error = tf.add(\
        tf.reduce_mean(tf.square(tf.subtract(aabb_error, image_infos_aabbs))), \
        tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=prob_error, labels=image_infos_probs)))

with tf.variable_scope('optimizer'):
    # Define loss and optimizer
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
    self.train_op = optimizer.minimize(self.error)

So, how does that look? Am I doing this the correct way? The results seem alright but it feels like something could be improved...

1

1 Answers

0
votes

Your approach is reasonable and addition of multiple heads is the standard way to obtain a single scalar loss value. The only thing that gives suspicion is bounding box value if there's no coin. As you pointed out, no matter what the network decided for the bounding box values, it should not penalized if there's no object at all (no answer is better than the other in this case).

I would suggest two more approaches to deal with it:

  • Introduce one more input: a weight w, equal to 1.0 or 0.0, depending on whether there's an object or not. The weight should go as a coefficient of the regression loss.

  • Make a special tuple for the "missing coin", something like (-1, -1, -1, -1). This approach is less favorable, because the network will still tend towards the (0, 0) corner, even though there's no reason for that.


One more detail: pay attention to the scale of both losses. It's fairly possible that L2 loss tf.square(tf.subtract(aabb_error, image_infos_aabbs)) would be in general 1-2 orders of magnitude larger than the cross-entropy loss and so it will totally dominate. This might slow down the training significantly. You might want to do weighted summation to account for that.