2
votes

In Fast RCNN, I understand that you first apply a CNN to the image in order to get a feature map. Then, you use the ROIs generated an external object detector (selectivesearch) to get the bounding box of potential objects of interests. However, I don't understand how you get the features from the feature map associated with the region of interest.

Ex. Apply Selectivesearch and I get a list of (x,y,width,height). Then, I apply a CNN(inceptionv3) to get a 2048x1 feature vector(from pool3 layer). How do I get the regions of interest from my feature vector of the image or am I interpreting this method incorrectly

Thanks for your help!

1

1 Answers

2
votes

Then you use CNN for classification task, your network has two part:

  1. Feature generator. Part which by image with size WI x HI and CI channels generates feature map with size WF x HF and CF channels. The relation between image sizes and feature map size depends of structure your NN (for example, on amount of pooling layers and stride of them). Also we can multiply strides of all layers in this part of CNN and get Step value (we will use it later)
  2. Classificator. Part which solve the task of classification vectors with WF*HF*CF components into classes.

Now if you have image with size W x H, and W > WI and H > HI, you can apply first part of your network (because in this part only convolution and pooling layers) and get feature map with WFB > WF and HFB > HF. Every windows with size WF x HF in this feature map corresponds to the window WI x HI on source image.

Rectangle (0, 0, WF, HF) on the feature map corresponds to the rectangle (0, 0, WI, HI) on the image. Rectangle (1, 0, WF+1, HF) corresponds to the rectangle (Step, 0, WI + Step, HI) on the image etc.

Therefore if you have coordinates of ROI in the feature map you can return to the ROI on the source image.