2
votes

I’m using Visual Recognition service on IBM Bluemix.

I have created some classifiers, in particular two of these with this objective:

  • first: a “generic” classifier that has to return the score of confidence about the recognition of a particular object in the image. I’ve trained it with 50 positive examples of the object, and 50 negative examples of something similar to the object (details of it, its components, images alike it etc.).
  • second: a more specific classifier that recognize the particular type of the object identified before, if the score of the first classification is quite high. This new classifier has been trained as the first one: 50 positive examples of type A object, 50 negative examples of type B object. This second categorization should be more specific that the first one, because the images are more detailed and are all similar among them.

The result is that the two classifiers work well, and the expected results of a particular set of images correspond to the truth in most cases, and this should mean that both have been well trained.

But there is a thing that I don’t understand.

In both classifiers, if I try to classify one of the images that have been used in the positive training set, my expectation is that the confidence score should be near to 90-100%. Instead, I always obtain a score that is included in the range between 0.50 and 0.55. Same thing happens when I try with an image very similar to one of the positive training set (scaled, reflected, cut out etc.): the confidence never goes above 0.55 circa.

I’ve tried to create a similar classifier with 100 positive images and 100 negative images, but the final result never change.

The question is: why the confidence score is so low? why it is not near to 90-100% with images used in the positive training set?

1

1 Answers

4
votes

The scores from Visual Recognition custom classifiers range from 0.0 to 1.0, but they are unitless and are not percentages or probabilities. (They do not add up to 100% or 1.0)

When the service creates a classifier from your examples, it is trying to figure out what distinguishes the features of one class of positive_examples from the other classes of positive_examples (and negative_examples, if given). The scores are based on the distance to a decision boundary between the positive examples for the class and everything else in the classifier. It attempts to calibrate the score output for each class so that 0.5 is a decent decision threshold, to say whether something belongs to the class.

However, given the cost-benefit balance of false alarms vs. missed detections in your application, you may want to use a higher or lower threshold for deciding whether an image belongs to a class.

Without knowing the specifics of your class examples, I might guess that there is a significant amount of similarity between your classes, that maybe in the feature space your examples are not in distinct clusters, and that the scores reflect this closeness to the boundary.