3
votes

I'm having trouble with pytesseract. I know that you can restrict tesseract to a specific set of characters using command line arguments :

tesseract input.tif output nobatch digits

I found some ppl saying they can restrict tesseract with the following lines in python :

import tesseract
ocr = tesseract.TessBaseAPI();
ocr.Init(".","eng",tesseract.OEM_TESSERACT_ONLY)
ocr.SetVariable("tessedit_char_whitelist", "0123456789")

But this is for using the tesseract API, and I'm using pytesseract.... Finally I also tried :

print(image_to_string(someimage, config='outputbase digits'))

But this doesn't work as I still get letters in my output. This is weird because I am using the below code and it is working :

print(image_to_string(screen, config='-psm 10'))

PSM stands for PageSegmentationMode and it allows me to parse my imagefile as a single character. I don't understand why this works and the snippet before doesn't when they are both commandline arguments to tesseract...

Can anyone help ? I want to use both options with a custom wordlist (that i created in the config folder of tesseract).

1

1 Answers

4
votes

Finally found the solution, if it can ever help anyone... This is from the tesseract help page :

Simplest invocation of tesseract :

tesseract imagename outputbase

I could deduce the proper syntax from that (in fact, everything I found on stack overflow pretty much pointed me in the wrong direction, maybe because of different versions of tesseract). Keep in mind I'm using tesseract 3.05 (win installer available on GitHub) and pytesseract (installed from pip).

image_to_string(someimage, config='digits -psm 7')

As we've seen on the help page, the outputbase argument comes first after the filename and before the other options, this allows the use of both PSM & restricted charset.

All the command line args from tesseract help page can be used this way, in the config variable !!