3
votes

I have an application where technical datasheets are OCR'd using the tesseract API. I initialize it like this:

tesseract::TessBaseAPI tess;
tess.Init(NULL, "eng", tesseract::OEM_TESSERACT_ONLY);

However, even after using custom whitelists like this

tess.SetVariable("tessedit_char_blacklist", "");
tess.SetVariable("tessedit_char_whitelist", myWhitelist);

some datasheet entries are recognized wrongly, for example PA3 is recognized as FAB.

How can I disable the dictionary-assisted OCR, i.e. . In order to not affect other tools I don't want to modify global config files if possible.

Note: This is not a duplicate of this previous question because said question explicitly asks for the command-line tool while I explicitly ask for the tesseract API.

3

3 Answers

6
votes

You can simply set the penalties to zero:

tess.SetVariable("segment_penalty_garbage", "0");
tess.SetVariable("segment_penalty_dict_nonword", "0");
tess.SetVariable("segment_penalty_dict_frequent_word", "0");
tess.SetVariable("segment_penalty_dict_case_ok", "0");
tess.SetVariable("segment_penalty_dict_case_bad", "0");

While the dictionary still stays active, this approach basically tells the algorithm that a dictionary-hit (also includes bad punctuation etc) is no better than a non-dictionary hit.

See the dict.cpp source code for reference.

5
votes

You can do it in following way

tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
if (api->Init(NULL, "eng"))
{
    fprintf(stderr, "Could not initialize tesseract.\n");
    exit(1);
}

if(!api->SetVariable("tessedit_enable_doc_dict", "0"))
{
    cout << "Unable to enable dictionary" << endl;
}

Simply pass "tessedit_enable_doc_dict" as a parameter to SetVariable function and it's corresponding boolean value.

I found it in tesseractclass.h https://tesseract-ocr.github.io/a00736_source.html header file(line 839) and i guess best way to find correct parameters is by looking at the values defined at it(header file corresponding to your version. mine is 3.04). I tried few i found on internet before but didn't work. This was the working configuration to me.

1
votes