I have an object detection problem where my data consists of images with bounding boxes. I've been reviewing several state of the art object detection networks (https://paperswithcode.com/task/object-detection) and am having trouble seeing where instance segmentation ends and object detection begins.
I'm trying to figure out what will perform best when trained with by bounding-box annotated data. Would something like mask R-CNN perform better than faster R-CNN, or would this performance boost require that all of my data be segmented at the pixel level instead of annotated with bounding boxes before fine-tuning? Would mask R-CNN outperform faster R-CNN if trained on bounding boxes and no segmented data? I know you can do bounding box inference with mask R-CNN, but can you train the model without pixel level segmentation? What is the state-of-the-art for object detection that doesn't require training with pixel-level segmentation?