1
votes

I have an object detection problem where my data consists of images with bounding boxes. I've been reviewing several state of the art object detection networks (https://paperswithcode.com/task/object-detection) and am having trouble seeing where instance segmentation ends and object detection begins.

I'm trying to figure out what will perform best when trained with by bounding-box annotated data. Would something like mask R-CNN perform better than faster R-CNN, or would this performance boost require that all of my data be segmented at the pixel level instead of annotated with bounding boxes before fine-tuning? Would mask R-CNN outperform faster R-CNN if trained on bounding boxes and no segmented data? I know you can do bounding box inference with mask R-CNN, but can you train the model without pixel level segmentation? What is the state-of-the-art for object detection that doesn't require training with pixel-level segmentation?

2

2 Answers

1
votes

Would something like mask R-CNN perform better than faster R-CNN, or would this performance boost require that all of my data be segmented at the pixel level instead of annotated with bounding boxes before fine-tuning?

Yes, Mask R-CNN does need your data to be segmented at pixel level, because it is a segmentation model, which is one level higher and more complex than object detection models.

Would mask R-CNN outperform faster R-CNN if trained on bounding boxes and no segmented data?

No, Mask R-CNN is based on Faster R-CNN object detection with the segmentation module added to it. So if the data is annotated using bounding boxes, Faster R-CNN is sufficient and there is no point in using Mask R-CNN.

I know you can do bounding box inference with mask R-CNN, but can you train the model without pixel level segmentation?

Yes, probably you can train the model that way, but the performance will not be good. Also there is no point in doing that, since Mask R-CNN is slightly slower than Faster R-CNN.

What is the state-of-the-art for object detection that doesn't require training with pixel-level segmentation?

Choosing between object detection and segmentation depends on the application and your purpose. If you are dealing with medical images for example, and trying to detect a tumor then you need segmentation, but for detecting a car on the street for instance, you might not care about the exact boundaries of the car and you just want to know the location of the car on the image. For this type of applications object detection should suffice. For state-of-the-art object detection that is also real-time I would suggest using YOLO since it is very fast and performs as well as Faster R-CNN if not better.

1
votes

Just to add more context, in the work developed by Rohit Malhotra et al. [1] the authors used a deep Mask R-CNN model, a deep learning framework for object instance segmentation to detect and quantify the number of individuals. In this work, they used the Mask R-CNN to detect the number of people. On the same hand, the Faster R-CNN [2] is extended to Mask R-CNN by adding a branch to predict segmentation masks for each Region of Interest (RoI) generated in Faster R-CNN. In the end, the authors measured the model in terms of Precision and Recall over the image sequences. The results are shown in the paper.

This method can be used to collect reliable and accurate data required for studies on effect of visitation policies, and frequency and timing of medical procedures on patients’ sleep-wake cycle, and consequently their outcome, e.g. hospital length of stay. Mask R-CNN can also be used for keypoint detection, which can be used for detection of postures of the patients in the hospital.

[1] Rohit Malhotra, K., Davoudi, A., Siegel, S., Bihorac, A. and Rashidi, P., 2018. Autonomous detection of disruptions in the intensive care unit using deep mask R-CNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 1863-1865).

[2] Ren, S., He, K., Girshick, R. and Sun, J., 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91-99).