Creating a training image for Tesseract OCR

Question

I'm writing a generator for training images for Tesseract OCR.

When generating a training image for a new font for Tesseract OCR, what are the best values for:

The DPI
The font size in points
Should the font be anti-aliased or not
Should the bounding boxes fit snugly: , or not:

Luiza Utsch Luiza Utsch · Accepted Answer · 2013-05-09T22:24:52

The 2th question is somehow answered here: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Generate_Training_Images There is no need to train with multiple sizes. 10 point will do. (An exception to this is very small text. If you want to recognize text with an x-height smaller than about 15 pixels, you should either train it specifically or scale your images before trying to recognize them.)

Questions 1 and 3: by experience, I've successfully used 300 dpi images/non anti-aliased fonts. More specifically, I have used the following convert parameters on a training pdf, which generated a satisfactory image:

convert -density 300 -depth 8 [input].pdf -background white -flatten +matte -compress none -monochrome [output].tif

But then I tried to add a dotted font to Tesseract and it only detected characters properly when I used a 150 dpi image. So, I don't think there's a general solution, it depends on the kind of fonts you're trying to add.

Creating a training image for Tesseract OCR

3 Answers