1
votes

I tried to improved the results of OpenSource OCR software. I'm using tessaract, because I find it still produces better results than gocr, but with bad quality input it has huge problems. So I tried to prepocess the image with various tools I found in the internet:

  • unpaper
  • Fred's ImageMagick Scripts: TEXTCLEANER
  • manuall using GIMP

But I was not able to get good results with this bad test document: (really just for test, I don't need to content of this file) http://9gag.com/gag/aBrG8w2/employee-handbook

This online service works surprisingly good with this test document: http://www.onlineocr.net/

I'm wonderung if it is possible using smart preprocessing to get similar results with tesseract. Are the OpenSource OCR engines really so bad compared to commercial ones? Even google uses tesseract to scan documents, so I was expecting more...

1

1 Answers

0
votes

Tesseract's precision in recognition is a little bit lower than the precision of the best commercial one (Abbyy FineReader), but it's more flexible because of its nature. This flexibility entail sometimes some preprocessing, because it's not possible for Tesseract to manage each situation. Actually is used by google because is Google its main sponsor!

The first thing you could do is to try to expand the text in order to have at least 20 pixel wide characters or more. Since Tesseract works using as features the main segments of the characters' borders, it needs to have a bigger characters' size comparing with other algorithms.

Another thing that you could try, always referring to the test document you mentioned, is to binarize your image with an adaptive thresholding method (here you can find some infos about that https://dsp.stackexchange.com/a/2504), because some changes in the illumination are present. Tesseract binarizes the image internally, but this could be the case when it fails to do that (it's similar to the example here Improving the quality of the output with Tesseract, where you can also find some other useful informations)