python-tesseract OCR: get digits only

Question

I'm using tesseract OCRwith python-tesseract. In the tesseract FAQ, regarding digits, we have:

Use

TessBaseAPI::SetVariable("tessedit_char_whitelist", "0123456789");

BEFORE calling an Init function or put this in a text file called tessdata/configs/digits:

tessedit_char_whitelist 0123456789

and then your command line becomes:

tesseract image.tif outputbase nobatch digits

Warning: Until the old and new config variables get merged, you must have the nobatch parameter too.

In python-tesseract, the SetVariable method exists. I've tried this, but the result of the OCR is the same:

api = tesseract.TessBaseAPI()
api.SetVariable("tessedit_char_whitelist", "0123456789")
api.Init('.','eng',tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO)

Did anyone already got this working, or should I consider it a bug in python-tesseract?

jpimentel jpimentel · Accepted Answer · 2012-03-21T13:22:09

OK, got it working. According to this (unofficial ?) documentation of tesseract-ocr, SetVariable() must be called after Init(), even though the opposite is said in the official FAQ. Calling it after Init() works as intended.

python-tesseract OCR: get digits only

1 Answers