Improve Tesseract OCR results with blurred text

Question

I am working on OCR recognition of printed text. In particular I am focusing on the preprocessing step to improve the results of the Tesseract engine. I have already obtained good results with adaptive thresholding, noise removal, text deskew, etc... But still Tesseract seems to fail when other commercial product return decent results.

I used the following test image and here are the results obtained with Tesseract 3.04 compared to two commercial OCR apis. All the 3 services were provided with the same binary image that contains some slightly blurred text.

Text image used to compared the 3 OCR products

Tesseract

Careers in Technology Consulting

Networking Lunch
21 m 2014, 11:00 - 14:30

Definingthecorporatellstmtegy, Wammmwdngdeal, creating
uniquebwinessisighnwilgbigdam-doesﬂismﬂxemmyouaﬁoy?

Findoutmoreabanhowitfeektomkasatedlﬂogymbyjoiningour

for further mm please visit mAeloittexom/weers

ABBYY Fine Reader Online

Careers in Technology Consulting
Networking Lunch
21 November 2014,1140-14:30
Defining the corporate IT strategy, planning a multHnKon <Mar outsourcing deal, creating unique business insights using big data-doesthis sound Ifce something you enjoy?
Find out more about hour it feels to work as a technology consultant by joining our exclusive networking lunch,
For further information please visit wrwMuleloittexom/carcert

Online OCR

Careers in Technology Consulting Networking Lunch 21 November 2014, 11;00 —14:30 
Defining the corporate IT strategy, planning a muiti-indlimi dollar outsourcing deal, creating unique business insights using big data—does this sound like something you enjoy? 
Find out more about how it feels to work as a tedmology consultant by joining our exclusive networking lunch, 
For further information' please visit wwwdeloitte,com/careers

Now I wonder whether the big gap between Tesseract and the other two products is due to a different engine (for sure ABBYY uses its own engine, not sure about OCR Web Service) or there are some other preprocessing steps that can be done before running Tesseract. Do you have any suggestions?

Claudio Claudio · Accepted Answer · 2017-03-29T10:21:23

Here a suggestion for "magic" OCR preprocessing. In order to explain the principle of the proposed preprocessing idea, let's consider an excerpt from the provided text image on which all of the tested OCRs failed :

and apply to it some "preprocessing-wisdom". First the usual thresholding:

and then some "magic" by shooting vertical lines through word-elements, detecting max. 2 pixel high "bars" and cutting them at their edges along with cutting the word-element down to its bottom line:

Now switching from shooting lines through the word-elements in this image from vertical to horizontal ones in order to detect very wide "bars" and cut them vertical in the middle of their width:

This should help any OCR-engine to provide better results on this particular image. I can imagine that some of the commercial OCR-engines use this approach already being able to provide a better recognition than this ones tested.

In this context let me mention another free OCR-engines available in the Ubuntu repositories (comparable with tesseract). Testing them against each other you can wonder even more how it comes that they provide different results and then look into their source code to know :) and infer from this experience something about the commercial ones.

sudo apt-get install cuneiform gocr ocrad

Improve Tesseract OCR results with blurred text

1 Answers