2
votes

I need to extract words from small images like this:

enter image description here

I am using tesseract from the command line with spanish language option, like this:

tesseract category.png -l spa -psm 7 category.txt

I think that this text must be easy to parse by the OCR but the word is not recognized. I am using -l spa for spanish language and -psm 7 because the image has got only line (anyway if I don't use -psm parameter the result is the same).

This is the result: s…"…

I am using this build with the lang package: http://domasofan.spdns.eu/tesseract/ (official source cited in github)

1

1 Answers

2
votes

Tesseract seems to really struggle when scanning low resolution characters.

enter image description here

Try to scan this image. I enhanced its resolution by 400 percent (I think 200 percent is possible for scanning, but lets try 400%), did a great amount of blurring and did threshold of ~140 value. Try scanning this one, the results should be much better and I hope this satisfy you. If you need to do that programmatically, write in comments what is unclear for you, I will provide you some additional information.