0
votes

I am working on some OCR experiments where I would like to improve the quality of Tesseract output. Basically the test subject is things like CAPTCHA, random characters on an obfuscated image. Now Tesseract isn't doing a very good job. Partially because sometimes it identifies certain character as several characters/digits separately.

I am wondering if telling Tesseract that, my specific image should always contain a text of length, say six, could improve the OCR recognition result a bit. But I am not sure if this is even supported in Tesseract.

I didn't find documentation on that point. Could someone help point out if such feature exists, and if does, what configuration parameter I can set. Thanks!

1

1 Answers

2
votes

Try this example for specifying length of the text. Please set value in for loop, which length you need to recognise text.

Consider following code:

Pix *image = pixRead("/usr/src/tesseract-3.02/phototest.tif");
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
api->Init(NULL, "eng");
api->SetImage(image);
Boxa* boxes = api->GetComponentImages(tesseract::RIL_TEXTLINE, true, NULL, NULL);
printf("Found %d textline image components.\n", boxes->n);
for (int i = 0; i < boxes->n; i++) {
    BOX* box = boxaGetBox(boxes, i, L_CLONE);
    api->SetRectangle(box->x, box->y, box->w, box->h);
    char* ocrResult = api->GetUTF8Text();
    int conf = api->MeanTextConf();
    fprintf(stdout, "Box[%d]: x=%d, y=%d, w=%d, h=%d, confidence: %d, text: %s",
                    i, box->x, box->y, box->w, box->h, conf, ocrResult);
}

In for (int i = 0; i < boxes->n; i++), replace boxes->n by 20 if you want specified length of 20.