I am facing following issue while training Tesseract OCR. I am using Tesseract 3.02 for windows.
I have a dataset of characters which is to be trained. I have written a C++ program to read each character from the data set, crop it & resize it to 40x40 image and merge/paste on a single image of size 650x450 (see attached image). This is repeated for all 100 images in dataset. The C++ program also generates the box file for every character added. I have verified the box file and image using Box editor tools mentioned on the Tesseract wiki. These files are correct. The extension of the merged image is .tif.
I am attaching the image for your reference. The issue is when I train the image in the Tesseract I get following output on console.
F:\test>tesseract eng.normal.exp0.tif eng.normal.exp0 box.train Tesseract Open Source OCR Engine v3.02 with Leptonica APPLY_BOXES: Boxes read from boxfile: 100 Found 100 good blobs. TRAINING ... Font name = normal Generated training data for 9 words
Even though there are 36 distinct words or characters in the image, the Tesseract says it could generate training data for only 9 characters. It also says it found 100 good blobs. I do not know why this issue is occurring. The box file has labels for all 100 characters in the image.
Please help!
Thanks