Next step in image preprocessing for OCR with Tesseract (tess4j)

Question

I've been trying to use Tesseract to identify some digits in a series of images and after scouring for advice I've made a number of improvements. So far I've attempted the following steps:

Binarize the image at an appropriate threshold to pick out the numbers
Restrict Tesseract to digits only
Upscale the image using a variety of approaches (getScaledInstance with Image.SCALE_SMOOTH, AffineTransform using AffineTransformOp.TYPE_BICUBIC)
Explore different Tesseract page segmentation modes. Currently using mode 6.

The numbers are all identical in shape and perfectly aligned, though their edges are somewhat jagged. Example processed images:

enter image description here

Tesseract does okay with these, but it often confuses 8 for 3, 6 for 5, 9 for 5.

I've been looking a little at different ways to smooth the image and trying different scales, but I'm also wondering if it makes more sense to just go through the process of training Tesseract. With only 10 possible values that are always almost identical, it seems like it shouldn't be too hard for it to learn to recognize them, but training Tesseract also seems like a huge pain.

Any suggestions on how to get the final bit of accuracy out of Tesseract on these images?

I'm using tess4j and Java, so Java-specific suggestions and libraries are especially appreciated. While I'm willing to implement algorithms myself, I'd hate to reinvent the wheel.

Alex Pritchard Alex Pritchard · Accepted Answer · 2015-02-26T06:12:35

I tried a few more preprocessing ideas without making much progress, including various types of greyscale, image color inversion, resizing and alternate binarization strategies. None of these were improved over my original, non-resized binarization. Ultimately I decided to give Tesseract training a go. I followed the instructions here: Manual Tesseract Training Walkthrough

I had a hard time finding any programs to help that actually worked in windows 64-bit and ended up doing most of the work by hand. I used jTessBoxEditor to edit the manually generated .box files, though I also did some editing in a text editor to add entries for missing characters the box file generator missed. I only have these small tiffs to work off of, so my training files don't meet the Tesseract wiki guidelines, but oh well.

I got some errors when using box.train:

FAIL! apply_boxes BOXFILE LINE ... failure! COULDN'T FIND A MATCHING BLOB

After unproductive googling I decided to ignore them and press on.

I got more errors when trying to run cntraining:

Error: Illegal number of feature sets!
signal_termination_handler:Error:Signal_termination_handler called:Code 3001

After MORE unproductive googling, I basically tried omitting each of my .tr files in turn to see which one caused the problem. Eventually I was able to complete cntraining with 1 missing file. I have no idea what effect this has on my output, but again I decided to just ignore it and keep going.

I ran into another problem running combine_tessdata:

Error opening unicharset file
Error combining tessdata files into foo.traineddata

This was because I needed to put my lang prefix before the unicharset file, which the tutorial didn't instruct me to do. After doing that, I successfully built A traineddata file. With no idea whether it would work, I dropped it into my tessdata directory, switch my language to the new trained language and tried again.

VOILA, it was perfect. It seems to now recognize my digits with 100% accuracy (at least across my limited sample size). The only preprocessing I'm doing is binarizing the images and no further cleanup or rescaling.

So, apparently with a small charset, manually training is worth the trouble. Took me probably 3 hours to muddle through finding tools that work and kludging my way through the process. For reference, I started with 14 tifs similar to those in my initial post. Four of them had one error or another along the way, plus the 1 I omitted from cntraining (but not from anything else..?), so like.. 9 and 2/3 images for training. It apparently was enough, thanks to the consistency of my characters.

Next step in image preprocessing for OCR with Tesseract (tess4j)

1 Answers