0
votes

I got the below image from the blog post here which was very informative.

image showing how convolutional neural net is used in fast rcnn

RCNN

In RCNN I get that selective search is used to select Regions of Interest ("proposals") and these are passed into a convNet which produces a feature vector of 4096 dimensions arbitrarily. This gets passed to an SVM, and we get a classification. Makes sense.

Fast-RCNN

"instead of feeding the region proposals to the CNN, we feed the input image to the CNN to generate a convolutional feature map.From the convolutional feature map, we identify the region of proposals and warp them into squares and by using a RoI pooling layer we reshape them into a fixed size so that it can be fed into a fully connected layer."

I know all of these words separately; but putting them together like this has confused me. For Fast-RCNN the distinction is that a ConvNet appears to be used to generate Regions of Interest, as oppose to a selective search. How does this work?

My current understanding is confused at steps 2/3, otherwise I think I am good :

  1. We have an image and feed it to a CNN.
  2. The CNN generates filters like usual, by randomly initializing some (and subsequently adjusting based on error.)
  3. Selective search used on the stack of convolved images?
  4. RoIs pooled to one size.
  5. Softmax layer to decide classification + LR to get bounding box.

Bonus: Why is the feature vector 4096 dimensions in RCNN? Just randomly-selected number?

1
Having your same doubts. Have you clarified this? How the region proposals are generated?xcsob

1 Answers

0
votes

I just read an article of Ross Girshick 'Region-based Convolutional Networks for Accurate Object Detection and Segmentation'. he said in part "We extract a fixed-length feature vector from each region proposal using a CNN. The particular CNN architecture used is a system hyperparameter. Most of our experiments use the Caffe [55] implementation of the CNN described by Krizhevskyetal.8,however we have also experimented with the 16-layer deep network from Simonyan and Zisserman [24] (OxfordNet). In both cases, the feature vectors are 4096-dimensional. Features are computed by forward propagating a mean-subtracted S ×S RGB image through the network and reading off the values output by the penultimate layer (the layer just before the softmax classifier). For TorontoNet, S = 227 and for OxfordNet S = 224. We refer readers to [8], [24], [55] for more network architecture details." that means he takes this number from two exepriences give the same result