I am working on OCR recognition of printed text. In particular I am focusing on the preprocessing step to improve the results of the Tesseract engine. I have already obtained good results with adaptive thresholding, noise removal, text deskew, etc... But still Tesseract seems to fail when other commercial product return decent results.
I used the following test image and here are the results obtained with Tesseract 3.04 compared to two commercial OCR apis. All the 3 services were provided with the same binary image that contains some slightly blurred text.
Tesseract
Careers in Technology Consulting
Networking Lunch
21 m 2014, 11:00 - 14:30
Definingthecorporatellstmtegy, Wammmwdngdeal, creating
uniquebwinessisighnwilgbigdam-doesflismflxemmyouafioy?
Findoutmoreabanhowitfeektomkasatedlflogymbyjoiningour
for further mm please visit mAeloittexom/weers
ABBYY Fine Reader Online
Careers in Technology Consulting
Networking Lunch
21 November 2014,1140-14:30
Defining the corporate IT strategy, planning a multHnKon <Mar outsourcing deal, creating unique business insights using big data-doesthis sound Ifce something you enjoy?
Find out more about hour it feels to work as a technology consultant by joining our exclusive networking lunch,
For further information please visit wrwMuleloittexom/carcert
Careers in Technology Consulting Networking Lunch 21 November 2014, 11;00 —14:30
Defining the corporate IT strategy, planning a muiti-indlimi dollar outsourcing deal, creating unique business insights using big data—does this sound like something you enjoy?
Find out more about how it feels to work as a tedmology consultant by joining our exclusive networking lunch,
For further information' please visit wwwdeloitte,com/careers
Now I wonder whether the big gap between Tesseract and the other two products is due to a different engine (for sure ABBYY uses its own engine, not sure about OCR Web Service) or there are some other preprocessing steps that can be done before running Tesseract. Do you have any suggestions?