Tesseract OCR not able to train image correctly

Question

I am facing following issue while training Tesseract OCR. I am using Tesseract 3.02 for windows.

I have a dataset of characters which is to be trained. I have written a C++ program to read each character from the data set, crop it & resize it to 40x40 image and merge/paste on a single image of size 650x450 (see attached image). This is repeated for all 100 images in dataset. The C++ program also generates the box file for every character added. I have verified the box file and image using Box editor tools mentioned on the Tesseract wiki. These files are correct. The extension of the merged image is .tif.

I am attaching the image for your reference. The issue is when I train the image in the Tesseract I get following output on console.

F:\test>tesseract eng.normal.exp0.tif eng.normal.exp0 box.train Tesseract Open Source OCR Engine v3.02 with Leptonica APPLY_BOXES: Boxes read from boxfile: 100 Found 100 good blobs. TRAINING ... Font name = normal Generated training data for 9 words

Even though there are 36 distinct words or characters in the image, the Tesseract says it could generate training data for only 9 characters. It also says it found 100 good blobs. I do not know why this issue is occurring. The box file has labels for all 100 characters in the image.

Please help!

Thanks

Ruwanka Madhushan Ruwanka Madhushan · Accepted Answer · 2015-12-30T06:23:29

The training data-set should be realistic according to the training guide. Note that as you mentioned it generated training data for 9 words not for 9 characters. Probably it may have identified all the characters. You can use this tool to inspect generated .traineddata file for analyze what are the characters that tesseract have been trained for.

Tesseract OCR not able to train image correctly

2 Answers