Digital Numbers on Tesseract OCR

Question

SOLUTION:

I've had to train my own data to try it with the OCR. It seems that works well, but I don't know why the trained data from arturaugusto not works for me =(

https://github.com/adri1992/Tesseract_sevenSegmentsLetsGoDigital.git

With my trained data, to get good results of the OCR, I've done this phases (I've done it with OpenCV):

First, convert the image to Black&White
Second, apply to the image a Gaussian Blur
Third, apply to the image a Threshold filter

With this, the seven segments digits are recognized.

QUESTION:

I'm trying to get an OCR through Tesseract on Android, and I'm testing the app with this image (via Text detection on Seven Segment Display via Tesseract OCR):

OCR test image

I'm using the data trained by arturaugusto (https://github.com/arturaugusto/display_ocr), but the wrong result of the OCR is:

884288

The zero is recognized as an eight, and I don't know why.

I'm applying to the image a Gaussian Blur and a threshold filter, via OpenCV, and the image processed is this:

OCR Image processed

Is there any other data trained or do you know any way to solve the problem?

Hi Felipe! I've trained my own data... Try it github.com/adri1992/Tesseract_sevenSegmentsLetsGoDigital and check me if it works for you. Remember to do all phases that I comment in the "solution" section of the post — adlagar
I managed to process your test image using python pillow and reaching a bw image similar to yours, but when I run tesseract with your trained data it returns an empty page (!). I'm not sure if I installed the trained data correctly... I copied everything to the folder /opt/local/share/tessdata (I'm on Mac OS X). When I run tesseract --list-langs the "lets" language is shown. Do you have any tips? By the way, your training data stopped mistaking "0" for "8" (as you stated in your question)? — Felipe Ferri
Hi Zeeshan! I trained my own data. It should be working with that concrete font github.com/adri1992/Tesseract_sevenSegmentsLetsGoDigital — adlagar

art art · Accepted Answer · 2015-06-02T19:29:06

Try using erode to fill the gaps between the segments. I think the problem is that tesseract can't handle well segmented font.

With OpenCV-python, I use cv2.erode(display,kernel, iterations = erosion_iters) to solve this problem.

Digital Numbers on Tesseract OCR

1 Answers