Improve Tesseract detection quality

Question

I am trying to extract alphanumeric characters (a-z0-9) which do not form sensefull words from an image which is taken with a consumer camera (including mobile phones). The characters have equal size and font type and are not formated. The actual processing is done under Windows.

The following image shows the raw input: Original image

After perspective processing I apply the following with OpenCV:

Convert from RGB to gray
Apply cv::medianBlur to remove noise
Convert the image to binary using adaptive thresholding cv::adaptiveThreshold
I know the number of rows and columns of the grid. Thus I simply extract each grid cell using this information.

After all these steps I get images which look similar to these:

enter image description here

Then I run tesseract (latest SVN version with latest training data) on each extracted cell image individually (I tried different -psm and -l values):

tesseract.exe -l eng -psm 11 sample.png outtext

The results produced by tesseract are not very good:

Most characters are not recognized.
The grid lines are sometimes interpreted as "l" or "i" characters.

I already experimented with morphologic operations (open, close, erode, dilate) and replaced adaptive thresholding with OTSU thresholding (THRESH_OTSU) but the results got worse.

What else could I try to improve the recognition quality? Or is there even a better method to extract the characters besides using tesseract (for instance template matching?)?

Edit (21-12-2014): I tested simple template matching (using normalized cross correlation and LMS but with even worse results). But I have made a huge step forward by extracting each character using findCountours and then running tesseract with only one character and the -psm 10 option which interprets each input image as a single character. Additonaly I remove non-alphanumeric characters in a post processing step. The first results are encouraging with detection rates of 90% and better. The main problem are misdetections of "9" and "g" and "q" characters.

Regards,

Alto Alto · Accepted Answer · 2014-12-22T09:57:58

As I say here, you can tell tesseract to pay attention on "almost same" characters. Also, there is some option in tesseract that don't help you in your example. For instance, a "Pocahonta5S" will become, most of the time, a "PocahontaSS" because the number is in a letter word. You can see in this way so.

Concerning pre-processing, you better have to use a sharpen filter. Don't forget that tesseract will always apply an Otsu's filter before reading anything. If you want good result, sharpening + Adaptive Threshold with some other filters are good ideas.

Improve Tesseract detection quality

3 Answers