Mask R-CNN Instance Segmentation vs. Object Detection

Question

I have an object detection problem where my data consists of images with bounding boxes. I've been reviewing several state of the art object detection networks (https://paperswithcode.com/task/object-detection) and am having trouble seeing where instance segmentation ends and object detection begins.

I'm trying to figure out what will perform best when trained with by bounding-box annotated data. Would something like mask R-CNN perform better than faster R-CNN, or would this performance boost require that all of my data be segmented at the pixel level instead of annotated with bounding boxes before fine-tuning? Would mask R-CNN outperform faster R-CNN if trained on bounding boxes and no segmented data? I know you can do bounding box inference with mask R-CNN, but can you train the model without pixel level segmentation? What is the state-of-the-art for object detection that doesn't require training with pixel-level segmentation?

Hadi GhahremanNezhad Hadi GhahremanNezhad · Accepted Answer · 2019-10-04T14:10:56

Would something like mask R-CNN perform better than faster R-CNN, or would this performance boost require that all of my data be segmented at the pixel level instead of annotated with bounding boxes before fine-tuning?

Yes, Mask R-CNN does need your data to be segmented at pixel level, because it is a segmentation model, which is one level higher and more complex than object detection models.

Would mask R-CNN outperform faster R-CNN if trained on bounding boxes and no segmented data?

No, Mask R-CNN is based on Faster R-CNN object detection with the segmentation module added to it. So if the data is annotated using bounding boxes, Faster R-CNN is sufficient and there is no point in using Mask R-CNN.

I know you can do bounding box inference with mask R-CNN, but can you train the model without pixel level segmentation?

Yes, probably you can train the model that way, but the performance will not be good. Also there is no point in doing that, since Mask R-CNN is slightly slower than Faster R-CNN.

What is the state-of-the-art for object detection that doesn't require training with pixel-level segmentation?

Choosing between object detection and segmentation depends on the application and your purpose. If you are dealing with medical images for example, and trying to detect a tumor then you need segmentation, but for detecting a car on the street for instance, you might not care about the exact boundaries of the car and you just want to know the location of the car on the image. For this type of applications object detection should suffice. For state-of-the-art object detection that is also real-time I would suggest using YOLO since it is very fast and performs as well as Faster R-CNN if not better.

Mask R-CNN Instance Segmentation vs. Object Detection

2 Answers