67
votes

I want to use tesseract to recognize only numbers. The problem is that I have mixture of numbers & letters and when I use SetVariable("tessedit_char_whitelist", "0123456789")
for every symbol tesseract returns wrong digit.

Can I set a threshold value so that tesseract omits the symbols with low resemblance?

NOTE: I set tesseract to recognize only digits so there is no confusion between O and 0.

10
Hi there, I am also using Tesseract with Java project and I am facing some issues, I have business cards images and I need to extract email addresses, the problem is that sometimes it makes confusion between numbers and letters, the email "[email protected]" becomes "[email protected]", would you have and idea how to fix this?Francisco Souza

10 Answers

44
votes

Recognizing only numbers is actually answered on the tesseract FAQ page. See that page for more info, but if you have the version 3 package, the config files are already set up. You just specify on the commandline:

tesseract image.tif outputbase nobatch digits

As for the threshold value, I'm not sure which you mean. If your input is an unusual font, perhaps you might retrain with a sample of your input. An alternative is to change tesseract's pruning threshold. Both options are also mentioned in the FAQ.

15
votes

For tesseract 3, the command is simpler tesseract imagename outputbase digits according to the FAQ. But it doesn't work for me very well.

I turn to try different psm options and find -psm 6 works best for my case.

man tesseract for details.

13
votes

For tesseract 3, i try to create config file according FAQ.

BEFORE calling an Init function or put this in a text file called tessdata/configs/digits:

tessedit_char_whitelist 0123456789                 

then, it works by using the command: tesseract imagename outputbase digits

12
votes

If one want to match 0-9

tesseract myimage.png stdout -c tessedit_char_whitelist=0123456789

Or if one almost wants to match 0-9, but with one or more different characters

tesseract myimage.png stdout -c tessedit_char_whitelist=01234ABCDE
9
votes

I made it a bit different (with tess-two). Maybe it will be useful for somebody.

So you need to initialize first the API.

TessBaseAPI baseApi = new TessBaseAPI();
baseApi.init(datapath, language, ocrEngineMode);

Then set the following variables

baseApi.setPageSegMode(TessBaseAPI.PageSegMode.PSM_SINGLE_LINE);
baseApi.setVariable(TessBaseAPI.VAR_CHAR_BLACKLIST, "!?@#$%&*()<>_-+=/:;'\"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, ".,0123456789");
baseApi.setVariable("classify_bln_numeric_mode", "1");

In this way the engine will check only the numbers.

3
votes

You can instruct tesseract to use only digits, and if that is not accurate enough then best chance of getting better results is to go trough training process: http://www.resolveradiologic.com/blog/2013/01/15/training-tesseract/

3
votes

This feature is not supported in version 4. You can still use it via -c tessedit_char_whitelist=0123456789 with "--oem 0" which reverts to the old model.

There is a bounty to fix this issue.

Possible workarounds:

As stated by @amitdo

3
votes

add "--psm 7 -c tessedit_char_whitelist=0123456789'" works for me when the image contain's only 1 line.

1
votes
custom_oem=r'digits --oem 1 --psm 7 -c tessedit_char_whitelist=0123456789'

text = tess.image_to_string(croped,config=custom_oem)

I am using tesseract 4.1.1.

For better result you might want to consider Image processing techniques.

-2
votes

What I do is to recognize everything, and when I have the text, I take out all the characters except numbers

//This replaces all except numbers from 0 to 9
recognizedText = recognizedText.replaceAll("[^0-9]+", " ");

This works pretty well for me.