2
votes

I am facing following issue while training Tesseract OCR. I am using Tesseract 3.02 for windows.

I have a dataset of characters which is to be trained. I have written a C++ program to read each character from the data set, crop it & resize it to 40x40 image and merge/paste on a single image of size 650x450 (see attached image). This is repeated for all 100 images in dataset. The C++ program also generates the box file for every character added. I have verified the box file and image using Box editor tools mentioned on the Tesseract wiki. These files are correct. The extension of the merged image is .tif.

I am attaching the image for your reference. The issue is when I train the image in the Tesseract I get following output on console.

F:\test>tesseract eng.normal.exp0.tif eng.normal.exp0 box.train Tesseract Open Source OCR Engine v3.02 with Leptonica APPLY_BOXES: Boxes read from boxfile: 100 Found 100 good blobs. TRAINING ... Font name = normal Generated training data for 9 words

Even though there are 36 distinct words or characters in the image, the Tesseract says it could generate training data for only 9 characters. It also says it found 100 good blobs. I do not know why this issue is occurring. The box file has labels for all 100 characters in the image.

Please help!

training image

Thanks

2

2 Answers

0
votes

The training data-set should be realistic according to the training guide. Note that as you mentioned it generated training data for 9 words not for 9 characters. Probably it may have identified all the characters. You can use this tool to inspect generated .traineddata file for analyze what are the characters that tesseract have been trained for.

0
votes

Per Training Wiki, "DO NOT MIX FONTS IN AN IMAGE FILE (In a single .tr file to be precise.) This will cause features to be dropped at clustering, which leads to recognition errors."