While training Mask RCNN using TensorFlow Object Detection API, what is the 'loss'?

Question

I am training for Custom Object Detection using Mask RCNN in TensorFlow Object Detection. Therefore, I am to predict the object instance mask along with the bounding box.

Pre-trained model : mask_rcnn_inception_v2_coco

Following is a snapshot of my training.

INFO:tensorflow:global step 4181: loss = 0.0031 (3.290 sec/step)

INFO:tensorflow:global step 4181: loss = 0.0031 (3.290 sec/step)

INFO:tensorflow:global step 4182: loss = 0.0030 (2.745 sec/step)

INFO:tensorflow:global step 4182: loss = 0.0030 (2.745 sec/step)

In this case, can you please tell me what is the loss here?

My questions is not related to training loss and its variation w.r.t. the steps.

I am just unclear about what is meant by this loss while training a Mask RCNN? In a Mask RCNN, there are 3 parallel heads at the last layer,

for detecting the class
for predicting bounding box
for predicting instance masks

In such a case, what is loss?

Mark.F Mark.F · Accepted Answer · 2019-01-24T10:11:11

The loss function of the Mask R-CNN paper combines a weighted sum of 3 losses (the 3 outputs): classification, localization and segmentation mask:

$L=w_{cls}\cdot L_{cls}+w_{bbox}\cdot L_{bbox}+w_{mask}\cdot L_{mask},$

The classification and bounding-box (localization) losses are the same as in Faster R-CNN.

What is added is a per-pixel sigmoid + binary loss for the mask. The mask branch generates a mask for each class, without competition among classes (so if you have 10 classes the mask branch predicts 10 masks). The loss being used is per-pixel sigmoid + binary loss.

If you want to dive in a little bit deeper into the mask loss, the paper states that "Multinomial vs. Independent Masks: Mask R-CNN decouples mask and class prediction: as the existing box branch predicts the class label, we generate a mask for each class without competition among classes (by a per-pixel sigmoid and a binary loss). In Table 2b, we compare this to using a per-pixel softmax and a multinomial loss (as commonly used in FCN [30])."

you can see it in the paper at page number 6, table number 2.b ("Multinomial vs. Independent Masks").

While training Mask RCNN using TensorFlow Object Detection API, what is the 'loss'?

1 Answers