Why is labeling training data this way leading to bad object detection when re-training pre-trained models using the tensorflow object detection api?

Question

So I've been messing around with tensorflow's object detection api and specifically the re-training of models, essentially doing this. I made it detect my object fairly well with a small number of images. But I wanted to increase the number of images I train with, however the labeling process is long and boring so I found a data set with cropped images, so only my object is in the image.

If there's a way to send whole images without labeling them too be trained using tensorflow api I didn't find it but I thought making a program that labels the whole image would not be that hard.

The format of the labeling is a csv file with these entries: filename, width, height, class, xmin, ymin, xmax, ymax.

This is my code:

import os
import cv2

path = "D:/path/to/image/folder"

directory = os.fsencode(path)
text = open("D:/result/train.txt","w")
for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.endswith(".jpg"):
        impath= path + "/" + filename
        img = cv2.imread(impath)
        res = filename+","+ str(img.shape[1])+","+str(img.shape[0])+",person,1,1,"+str(img.shape[1]-1) +"," +str(img.shape[0]-1)+"\n"
        text.write(res)
        print(res)
text.close()

This seems to be working fine.

Now here's the problem. After converting the .txt to .csv and running the training until the loss stops decreasing my detection on my test set are awful. It puts a huge bounding box around the entirety of the image like it's trained to detect only the edges of the image.

I figure it's somehow learning to detect the edges of the images since the labeling is around the whole image. But how do I make it learn to "see" what's in the picture? Any help would be appreciated.

Why do you set xmin, ymin, xmax, ymax to 1,1, img.shape[1]-1, img.shape[0]-1? — keineahnung2345
It's two pairs of coordinates that indicate that almost the whole image should be covered in a rectangular bounding box. (1,1) is the bottom left and (img.shape[1]-1, img.shape[0]-1) is the top right. — John Slaine
You set the ground truth to be nearly the whole image, so it's obvious that your model will learn to predict huge boxes. — keineahnung2345
The training images themselves are small 64x128, the images I test on are significantly larger. But it seems odd that there is no way to simply input the cropped images — John Slaine
You should check this: stackoverflow.com/questions/45093955/…. — keineahnung2345

Dmytro Prylipko Dmytro Prylipko · Accepted Answer · 2019-02-19T15:21:25

The model predicts exactly what is what trained for: Huge bounding boxes for the entire image. Obviously, if your training data comprises only boxes with coordinates [0, 0, 1, 1], the model will learn it and predict for the test set.

You may try to use kind of augmentation: Put your images on a larger black/grey canvas and adjust bounding boxes correspondingly. That is what SSD augmentation does, for instance. However, there is no free and good way to compensate the absence of a properly labelled train set.

Why is labeling training data this way leading to bad object detection when re-training pre-trained models using the tensorflow object detection api?

1 Answers