0
votes

I am trying Tensorflow faster_rcnn_resnet101 model to detect multiple objects in a 642*481 image. I tried two ways but none got satisfactory results.

1) In this way, I cropped the objects (50*160 rectangle) from training images (training images might be at a different dim size than 642*481 test image). I use those cropped images to train the faster_rcnn_resnet101. Looks like it has good result if the test set is also a cropped image on the same size. But for the 642*481 test image, it could not detect multiple objects there with good results.

Then I think about maybe the model rescaled the test image to match the 50*160 so the details got lost. In this thought, I tried another way

2) I copied each cropped images into a 642*481 white background respectively (padding basically). So each training image has the same dim size as the test image. The location of the cropped images copied on the background is purposely set as random. However, it still does not have good detection on the test image. To try its performance, I use GIMP to make windows that contain objects and replace other parts outside the windows with white pixels. The result is much better. If we only keep one object window in the image and make other parts as white, the result is super.

So my question is what is happening behind those? How could I make it work to detect multiple objects in the test images successfully? Thanks.

1

1 Answers

0
votes

The idea is that you should use training images containing multiple instances of objects during training. In other words, you don't have to crop ! Especially that your images are not very big.

However, you DO have to identify (label) the objects. This means to know the bounding box of each object in an image. Furthermore, this data should be assembled together with the original image in a tfrecord file. There is a good guide here with the whole process. In your case, you would label all objects in the source image (642x481), instead of just the "raccoon". If your objects have multiple classes, make sure you label them as so !

If you train with these images, which contain the objects in their contexts, then the network will learn to recognize similar images, which have objects in context.