3
votes

my goal is to do multi-class image classification with probability estimation.

So far the 'one-label'-classification is working nicely out-of-the-box with all the great functionalities the OpenCV C++ libraries provide. Currently I am using a BoW descriptor with local Sift descriptors and SVM classification. So far so good. But now I need probability estimates for the images. So instead of "image A is class X", I need the output "image A is with 50% likelihood class X, with 10% class Y, 30% class Z", etc. with estimations for all classes.

Unfortunately I am not that competent in machine learning. I started to investigate the problem and now my brain hurts. My noob-questions for you:

Any tips are appreciated. Thanks!

P.S.: I know there are a bunch of similar questions answered here before but to me none of them really captured my point.

1

1 Answers

3
votes

Some implementations of the SVM algorithm do provide probability estimates. However, the SVM does not inherently provide probability estimates. It is a function that is "tacked on" after the algorithm was created. These probability estimates are not "trustworthy", and if I remember correctly, the ability to compute probability estimates was removed from the Scikit-Learn library a few releases ago for this reason. Still, if you insist on using SVM, look at A Practical Guide to Support Vector Classification from LibSVM. It is the library that OpenCV calls. You can skip the math to get to the tips. The outputs of LibSVM, and hence OpenCV's SVM, are explained in the document. Alternatively, you can choose to use LibSVM instead. This will allow you to get to the probability estimates without recompiling OpenCV (as suggested in your link), but the downside is you will have to pass your data to the appropriate form for LibSVM (i.e., OpenCV's Mat is unlikely to work directly with LibSVM).

If you are using a Linear SVM, i.e., SVM with the linear kernel, then you can try replacing it with a Logistic Regression classifier as empirically they behave similarly (both are linear classifiers, just that one uses hinge loss and the other, logistic loss). The probability estimates from Logistic Regression would work.

Alternatively, consider using a Random Forest (or its variant, Extremely Randomized Trees) classifier. They also provide probability estimates as the proportion of training samples in a given leaf node reached by your test sample. Having said that, these two classifiers are not based on principled mathematics (although researchers are working on figuring out how they work theoretically), although they have been known to work superbly in many real-world settings (Kinect pose estimation is an example).

The thing is coming up with probability estimates is a very hard if your classifier is not designed to do that from the beginning, i.e., not one of those you find from a standard statistical machine learning textbook. It is like pulling numbers out of one's ass. Most algorithm that perform classification simply compute a "score" for each category/label for each test sample and go with the one with the "best" score. That is much easier to do. For the SVM, it tries to "translate" this score to a "probability", but it is not "calibrated", which effectively makes it useless.

You can take a look at this paper: Predicting Good Probabilities With Supervised Learning for more details on how the probabilities are computed for some of these classifiers, and why they need to be calibrated.

In general, I would advice taking probability estimates returned by classifier with a grain of salt. If you want them, go with a statistical classifier, e.g., Logistic Regression, not SVM.

As for libraries, while OpenCV do provide some machine learning algorithms, they are very limited. Try a proper ML library. I'm assuming you are using C++, so I will recommend taking a look at the free Shogun Machine Learning Library.

If you are using Python, or just wish to take a look at tutorials on how to use machine learning algorithms, then check out the excellent Scikit-Learn library.

Some general advice on applying machine learning algorithms to industry problems (slides): Experiences and Lessons in Developing Industry-Strength Machine Learning and Data Mining Software.