How to avoid Tesseract from recognizing small lines as numbers or letters?

Question

I'm using Tesseract to recognize big and clear text in 1bpp images. It works beautifully for the font and font-size I selected. However, it also recognizes some small lines and speckles as letters/numbers. In the attached image, Tesseract does not only recognize "Ge", "1", "2", "J.", and "Sp", but also an additional "1" for each line, corresponding to those small vertical lines you can see there. How can avoid Tesseract from doing this?

Thanks in advance.

Sample image

cortex42 cortex42 · Accepted Answer · 2014-12-02T14:52:49

You should preprocess your image first. OpenCV offers some morphological operations like eroding or dilating which could remove these speckles and lines (http://docs.opencv.org/doc/tutorials/imgproc/erosion_dilatation/erosion_dilatation.html).

How to avoid Tesseract from recognizing small lines as numbers or letters?

2 Answers