9
votes

In the Fast RCNN approach, region proposals in the original image are projected onto the output of the final convolutional feature map. In the case of the VGG net, the input image is of size 224 x 244 and the final output of the convolutional feature map 14 x 14 x 512.

Does this mean that proposals on the input image are projected onto the feature map for ROI pooling ? Is the projection a simple scaling of the bounding box ?

1

1 Answers

1
votes

This article gives a good description of RoI pooling and how you get the RoI BB equivalent for the feature map from the original label.

https://medium.com/datadriveninvestor/review-on-fast-rcnn-202c9eadd23b

Basically, the goal of RoI pooling is to output a fixed size feature map from an arbitrary size section of the CNN output feature map.

To do this, you have to do RoI projection to translate the RoI BB (x,y,h,w) from the original image to the RoI BB you need in the feature map. This is done by scaling it based on the sub-sampling ratio.

Ex.)

  • If your image is 18x18 and your feature map is 3x3 then your sub-sampling ratio is 3/18.
  • To get your projected RoI BB, then you multiply that by your original BB values like x' = (3/18)x

Then you just do the pooling on that section of the feature map, with an H×W number of pooling windows with sizes ~h'/H×w'/W where H and W are the height and width of your target output for the pooling layer.

The article gives a much better description and I encourage you to check it out and the original paper!