1
votes

I'm using Tesseract to recognize big and clear text in 1bpp images. It works beautifully for the font and font-size I selected. However, it also recognizes some small lines and speckles as letters/numbers. In the attached image, Tesseract does not only recognize "Ge", "1", "2", "J.", and "Sp", but also an additional "1" for each line, corresponding to those small vertical lines you can see there. How can avoid Tesseract from doing this?

Thanks in advance.

Sample image

2

2 Answers

1
votes

You should preprocess your image first. OpenCV offers some morphological operations like eroding or dilating which could remove these speckles and lines (http://docs.opencv.org/doc/tutorials/imgproc/erosion_dilatation/erosion_dilatation.html).

1
votes

Like the other answers suggested some simple eroding will help to remove the lines. However, if the lines are always outside of the area where the real characters are you could try a simple trick to avoid a degradation of the real characters while eroding. Use a strongly eroded image to find the bounding box for the real chars and use this bbox to cut out the interesting part of the original image.