How to confidently validate object detection results returned from Google Cloud Vision

Question

I am trying to build a program that can correctly and confidently identify an object with Google Cloud Vision (denoted as GCV henceforth). The results returned are correct most of the times with a certain accuracy score for each label, as such.

{
    "banana": "0.92345",
    "yellow": "0.91002",
    "minion": "0.89921",
}

The environment I am working with has diverse set of lightning condition and objects are expected to placed in random position. When an object with different position is placed, the results returned from GCV will be slightly different because a different image is queried. For example,

{
    "banana": "0.82345",
    "lemon": "0.82211",
    "yellow": "0.81102",
    "minion": "0.79921",
}

My program is designed in a way that, when object banana is detected with accuracy greater than certain value, then next action will be dispatched.

There are 3 clusters of object types. For instance, banana goes to container A, apple goes to container B and orange goes to container C.

When I present my work to my professor, he questioned that how can I confidently define and validate the threshold value for each item, as respected to its respective cluster.

I tried to obtain a mean score of banana by training hundreds of banana images but eventually I found that this is probably not the correct way of defining a threshold. My professor suggested to use K Nearest Neighbour to find similarity of those images but isn't that already a part of GCV? Even if what he suggested is correct, what is the correct approach to train a post GCV classifier, with the limited data returned from GCV?

Nakilon Nakilon · Accepted Answer · 2017-12-03T17:53:01

In my practice I used the NN algorithm and few more things that I didn't know their names until I had to make slides about my work. It was about predicting if the Facebook advertisement post would be banned by FB or not.

The NN needs some function of two elements to measure the distance between them so to find the best function I faced the same problem as you -- to measure how much accurate the chosen approach is. If you know real classes of the whole set then after classifying by your model you'll get four cases that are called confusion matrix.
https://en.wikipedia.org/wiki/Confusion_matrix

You can apply different metrics to confusion matrix to measure the accuracy you've got and the metric that I chose appeared to be known as a Fowlkes-Mallows index.
https://en.wikipedia.org/wiki/Fowlkes%E2%80%93Mallows_index

I could train the system on one set of images, then tell to classify another one and then calculate the FM, but how exactly do I split the whole set in two? So I didn't. Instead, for a set of N images I was taking one image out of the set and retraining the model N times on remaining N-1 images. That technique is called a cross-validation.
https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29#k-fold_cross-validation

This is enough to answer to you current question. In my case I had 2000 images of two classes and resulting FM equal to 0.6. Here are slides (in Russian) and they have all these links at the end: https://drive.google.com/file/d/0B3BLwu7Vb2U-SVhKYWVMR2JvOFk/view?usp=sharing

What you'll find later is that the accuracy can be increased a lot if you optimize the set by throwing out some images that teach the model wrong assumptions or just are not useful if they are already densely surrounded by cases of the same class in the space that is used by NN. So I was throwing out different subsets and recalculating the FM.
But since cross-validation needs retraining and for a set of 2000 images I had 2^2000 ways to shrink it this was a very slow procedure and so I could not fully solve the optimization. You may try depth-first traversing of a tree of ways to throw subsets out with some heuristics but I've used a custom approach and was able to increase the FM from 0.6 to 0.8 in two hours and classifying was correct in 98% cases.

How to confidently validate object detection results returned from Google Cloud Vision

1 Answers