5
votes

I 'm using Tesseract OCR engine in an iPhone application to read specific numeric fields from bill invoice photos. Using a lot of photo pre-processing (adaptive thresholding, artifact cleaning, etc) the results are finally fairly accurate but there are still some cases I want to improve.

If the user takes a photo in low-light conditions and there is some noise or artifacts in the picture, the OCR engine interprets these artifacts as additional digits. In some rear cases it can read e.g. a numeric amount of "32,15" EUR as "5432,15" EUR and this is not at all good for the final user confidence in the product.

I assume that, if there is an internal OCR engine read-error associated to each character read, it will be higher on the "54" digits of my previous example as they are recognized over small noise-pixels, and if I had access to this reading-error values I will be able to easily discard the erroneous digits.

Do you know of any method to get a reading error magnitude (or any "accuracy factor" value) for each individual character returned from tesseract OCR engine?

1

1 Answers

4
votes

It is called "confidence" value in Tesseract terminology. Search for that term in tesseract-ocr Group turned up many answers that mention about a TesserractExtractResult method.

The hOCR output also contains this value.