Recently, I've been playing with the MATLAB's RCNN deep learning example here. In this example, MATLAB has designed a basic 15 layer CNN with the input size of 32x32. They use the CIFAR10 dataset to pre-train this CNN. The CIFAR10 dataset has training images of size 32x32 too. Later they use a small dataset of stop signs to fine tune this CNN to detect stop signs. This small dataset of stop signs has only 41 images; so they use these 41 images to fine tune the CNN and namely train an RCNN network. this is how they detect a stop sign: As you see the bounding box almost covers the whole stop signs except a small part on the top. Playing with the code I decided to fine tune the same network pre-trained on the CIFAR10 dataset with the PASCAL VOC dataset but only for the "aeroplane" class. These are some results I get:
As you see the detected bounding boxes barely cover the whole airplane; so this causes the precision to be 0 later when I evaluate them. I understand that in the original RCNN paper mentioned in the MATLAB example the input size 227x227 and their CNN has 25 layers. Could this be why the detections are not accurate? How does the input size of a CNN affect the end result?