I got the below image from the blog post here which was very informative.
RCNN
In RCNN I get that selective search is used to select Regions of Interest ("proposals") and these are passed into a convNet which produces a feature vector of 4096 dimensions arbitrarily. This gets passed to an SVM, and we get a classification. Makes sense.
Fast-RCNN
"instead of feeding the region proposals to the CNN, we feed the input image to the CNN to generate a convolutional feature map.From the convolutional feature map, we identify the region of proposals and warp them into squares and by using a RoI pooling layer we reshape them into a fixed size so that it can be fed into a fully connected layer."
I know all of these words separately; but putting them together like this has confused me. For Fast-RCNN the distinction is that a ConvNet appears to be used to generate Regions of Interest, as oppose to a selective search. How does this work?
My current understanding is confused at steps 2/3, otherwise I think I am good :
- We have an image and feed it to a CNN.
- The CNN generates filters like usual, by randomly initializing some (and subsequently adjusting based on error.)
- Selective search used on the stack of convolved images?
- RoIs pooled to one size.
- Softmax layer to decide classification + LR to get bounding box.
Bonus: Why is the feature vector 4096 dimensions in RCNN? Just randomly-selected number?