Adding Blackletter Font Support to Tesseract OCR Engine

Question

I'm working on getting the Lincoln font to work in Tesseract, and I'm getting abysmal results, even after going through the wildly complicated training process.

This is what the font looks like, so yeah, it's a bit tricky:

Lincoln sample

I've carefully made a training image, and then used that to make a box file. The training image is here (25MB!). The image is 300 DPI, and has representative characters nicely spaced out vertically and horizontally.

I made a box file for the training image, and it worked properly. I've verified that it's correct using a box file editor.

I took this box file/tif file, and used it to create training data. I did likewise with the 30 or so other sample images/fonts provided by Tesseract.

I created the unicharset file.

I created a font_properties file. There's no guidance on the site about when fraktur should be used. So I've tried it both this way (fraktur on for Lincoln):

eng.lincoln.box 0 0 0 0 1

And this way (fraktur off):

eng.lincoln.box 0 0 0 0 0

And finally, I've tried this with and without dictionary files. When I used dictionary files, they were the wordmap from my search engine, Sphinx, and they have about 15K common words and about 20K uncommon ones.

In all cases, when I try to OCR the first couple lines of this file (3MB), the quality is abysmal. Rather than getting:

United States Court of Appeals 
for the Federal Circuit

I get:

OniteiJ %tates C0urt of QppeaIs
for the jfeI1eraICircuit

Why?

Andrew Cash Andrew Cash · Accepted Answer · 2012-01-29T14:01:22

I am not a Tesseract expert but I have evaluated nearly every OCR engine available and my comments are based on my experience over the years of analysing OCR errors.

Just wondering why your image has speckles in the background and not a pure white background. I don't know how Tesseract or the training tool works but the background could be causing some problems.

Just reading the sample page is difficult and requires a large amount of concentration. Characters such as F and I are very similar as are U and N. Tesseract like many OCR engines would be using many different techniques to recognise a character and there is not a whole lot difference between many of these characters in terms of the strokes and curves used in the font.

These characters, especially the uppercase characters would confuse many different matching algorithms just because they are so different to standard Latin / Roman type characters. This shows through in your results ie. All capital letters have an OCR error.

Adding Blackletter Font Support to Tesseract OCR Engine

2 Answers