Adding New Fonts to Tesseract 3

Question

I'm trying to add new fonts to tesseract ocr. I'm following this tutorial but I'm having some problems.

Here's what I've done so far:

Create training document

convert eng.myfont.exp0.pdf eng.myfont.exp0.tif
Train Tesseract

tesseract eng.myfont.exp0.tif eng.myfont.exp0 batch.nochop makebox

This created my eng.myfont.exp0.box file.

I open the file with moshpytt and make sure it was detected correctly.
Feed the box file back into tesseract

tesseract eng.myfont.exp0.tif eng.myfont.exp0.box nobatch box.train.stderr

I have this result:

Tesseract Open Source OCR Engine v3.03 with Leptonica
APPLY_BOXES:
Boxes read from boxfile: 146
Found 146 good blobs.
TRAINING ... Font name = myfont.exp0
Generated training data for 6 words
- eng.myfont.exp0.box.tr file and eng.myfont.exp0.box.txt generated
try to detect the Character set used in the box file (this is where I get stuck)

unicharset_extractor *.box

Result:

unicharset_extractor: command not found

I also tred unicharset_extractor eng.myfont.exp0.box with the same result.

I'm using:

tesseract 3.03
leptonica-1.70
libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0
Ubuntu 14.04.1 LTS

That's pretty peculiar. It just means the command cannot be found. On my system I'm able to find that command without any issue in /usr/local/bin/unicharset_extractor. — mlissner

nguyenq nguyenq · Accepted Answer · 2014-10-26T19:19:24

The training tools for Tesseract 3.03 RC were omitted from Ubuntu 14.04. So either fall back to Tesseract 3.02 or upgrade to Ubuntu 14.10, which should have it.

Adding New Fonts to Tesseract 3

2 Answers