I'm working on getting the Lincoln font to work in Tesseract, and I'm getting abysmal results, even after going through the wildly complicated training process.
This is what the font looks like, so yeah, it's a bit tricky:
I've carefully made a training image, and then used that to make a box file. The training image is here (25MB!). The image is 300 DPI, and has representative characters nicely spaced out vertically and horizontally.
I made a box file for the training image, and it worked properly. I've verified that it's correct using a box file editor.
I took this box file/tif file, and used it to create training data. I did likewise with the 30 or so other sample images/fonts provided by Tesseract.
I created the unicharset file.
I created a font_properties file. There's no guidance on the site about when fraktur should be used. So I've tried it both this way (fraktur on for Lincoln):
eng.lincoln.box 0 0 0 0 1
And this way (fraktur off):
eng.lincoln.box 0 0 0 0 0
And finally, I've tried this with and without dictionary files. When I used dictionary files, they were the wordmap from my search engine, Sphinx, and they have about 15K common words and about 20K uncommon ones.
In all cases, when I try to OCR the first couple lines of this file (3MB), the quality is abysmal. Rather than getting:
United States Court of Appeals
for the Federal Circuit
I get:
OniteiJ %tates C0urt of QppeaIs
for the jfeI1eraICircuit
Why?