Tesseract confuses two numbers

Question

I'm writing an application to scan numbers from an image.

The numbers are using the OCR-B font and may also contain + and > characters.

This is my source image:

source image

The scans using Tesseract weren't very good, even when limiting the character set to the mentioned characters. As I didn't find any OCRB training files for Tesseract, I decided to train it myself.

I created this training image and made a box file from it. The box file is correct, all letters are matched correctly.

Then I did all steps described here to create the other necessary files.

Using this newly trained OCR-B tessdata-set, I get pretty good results on the source image, with one little bug: All 1s are mistaken for 8s and vice-versa. The command used to process the image was

$ tesseract esr2c.tif ocrb-esr2c -l ocrb

and the output for the source image was

0800000001456>8 00000195731208 8 01050008 023+ 08 0301226>20

If you swap all 1s and 8s and compare it to the source image, the output would be correct (except for the last two letters which I can ignore).

How could this happen? Did I do some mistake in the training process? How can I fix it?

@andrew not really. just an old, invalid bill without any personal information in the reference id. — Danilo Bargen
@DaniloBargen: If possible, can you share the training data for OCRB font? — Ravi Gupta
@RaviGupta I don't have it anymore, and the results weren't good anyways. — Danilo Bargen
Hi, so, sorry for revamping this but, fast forward 5 years then fast forward 1 year, has that training helped you get correct results, I mean, did you continue to use tesseract? — Marko

nguyenq nguyenq · Accepted Answer · 2011-09-03T16:53:25

It's likely that somewhere in your box file has incorrect values (characters) for 1 and 8. You can verify using jTessBoxEditor program. If so, correct, regenerate the language data file, and try again.

Tesseract confuses two numbers

2 Answers