Is it normal that tesseract does not recognize this word in this image?

Question

I need to extract words from small images like this:

I am using tesseract from the command line with spanish language option, like this:

tesseract category.png -l spa -psm 7 category.txt

I think that this text must be easy to parse by the OCR but the word is not recognized. I am using -l spa for spanish language and -psm 7 because the image has got only line (anyway if I don't use -psm parameter the result is the same).

This is the result: s…"…

I am using this build with the lang package: http://domasofan.spdns.eu/tesseract/ (official source cited in github)

Dainius Šaltenis Dainius Šaltenis · Accepted Answer · 2016-04-17T14:36:24

Tesseract seems to really struggle when scanning low resolution characters.

Try to scan this image. I enhanced its resolution by 400 percent (I think 200 percent is possible for scanning, but lets try 400%), did a great amount of blurring and did threshold of ~140 value. Try scanning this one, the results should be much better and I hope this satisfy you. If you need to do that programmatically, write in comments what is unclear for you, I will provide you some additional information.

Is it normal that tesseract does not recognize this word in this image?

1 Answers