3
votes

I have problems with the general recognition of subscript and superscript in text fragments.

Example-image:

example-image with subscript and superscript

I used Tesseract 4.1.1 with the training data available under https://github.com/tesseract-ocr/tessdata_best. The numerous options had default values except:

  • tessedit_create_hocr = 1 (to get result as HOCR)
  • hocr_font_info = 1 (to get additional font infos like font size)
  • hocr_char_boxes = 1 (to get character-based result)

The language was set to eng. Neither with page segmentation mode 3 (PSM_AUTO_OSD) nor 11 (PSM_SPARSE_TEXT) nor 12 (PSM_SPARSE_TEXT_OSD) the subscript/superscript was recognized correctly.

In the output the sub/sup-fragments were all more or less wrong:

  • "SubtextSub" is recognized as "Subtextsu,"
  • "SuptextSub" is recognized as "Suptexts?"
  • "P0" is recognized as "Po"
  • "P100" is recognized as "P1go"
  • "a2+b2" is recognized as "a+b?"

Using Tesseract for OCR is there a way to ...?

  1. optimize subscript/superscript handling
  2. get infos about recognized subscript/superscript (in the hocr-output - ideally for each character)
3
To give a little bit of context: Superscripts and subscripts are important when it comes to chemical formulas. Superscripts are also used for footnotes. The distinction to normal text is relevant when the superscript is after a number: Revenue in Q1 (in million USD): 54² is very different from Revenue in Q1 (in million USD): 542Martin Thoma

3 Answers

1
votes

Working on the quality of the image as suggested in other questions/answers to this topic didn't really change anything.

Following these 2 links from the tesseract-google-newsgroup at first it really seemed to be a question of training: link1 and link2.

But after doing some experiments I found out, that the used OEM_DEFAULT-OCR engine mode just doesn't bring up the needed information. I found a partial solution to the problem. Partial, because I now get most infos about sub/sup and also the recognized characters are right in most cases, but not for all characters.

Using the OEM_TESSERACT_ONLY-OCR engine mode (=the legacy mode) and some API methods provided by Tess4J I came up with the following java test class:

public class SubSupEvaluator {
    public void determineSubSupCharacters(BufferedImage image) {
        //1. initialize Tesseract and set image infos
        TessBaseAPI handle = TessAPI1.TessBaseAPICreate();
        try {
            int bpp = image.getColorModel().getPixelSize();
            int bytespp = bpp / 8;
            int bytespl = (int) Math.ceil(image.getWidth() * bpp / 8.0);
            TessBaseAPIInit2(handle, new File("./tessdata/").getAbsolutePath(), "eng", TessOcrEngineMode.OEM_TESSERACT_ONLY);
            TessBaseAPISetPageSegMode(handle, TessPageSegMode.PSM_AUTO_OSD);
            TessBaseAPISetImage(handle, ImageIOHelper.convertImageData(image), image.getWidth(), image.getHeight(), bytespp, bytespl);

            //2. start actual OCR run
            TessBaseAPIRecognize(handle, null);

            //3. iterate over the result character-wise
            TessResultIterator ri = TessBaseAPIGetIterator(handle);
            TessPageIterator pi = TessResultIteratorGetPageIterator(ri);
            TessPageIteratorBegin(pi);
            do {
                //determine character
                Pointer ptr = TessResultIteratorGetUTF8Text(ri, TessPageIteratorLevel.RIL_SYMBOL);
                String character = ptr.getString(0);
                TessDeleteText(ptr); //release memory

                //determine position information
                IntBuffer leftB = IntBuffer.allocate(1);
                IntBuffer topB = IntBuffer.allocate(1);
                IntBuffer rightB = IntBuffer.allocate(1);
                IntBuffer bottomB = IntBuffer.allocate(1);
                TessPageIteratorBoundingBox(pi, TessPageIteratorLevel.RIL_SYMBOL, leftB, topB, rightB, bottomB);

                //write info to console
                System.out.println(String.format("%s - position [%d %d %d %d], subscript: %b, superscript: %b", character, leftB.get(), topB.get(),
                    rightB.get(), bottomB.get(), TessAPI1.TessResultIteratorSymbolIsSubscript(ri) == TessAPI1.TRUE,
                    TessAPI1.TessResultIteratorSymbolIsSuperscript(ri) == TessAPI1.TRUE));
            } while (TessPageIteratorNext(pi, TessPageIteratorLevel.RIL_SYMBOL) == TessAPI1.TRUE);
        } finally {
            TessBaseAPIDelete(handle); //release memory
        }
    }
}

The legacy mode only works with 'normal' training data. Using the '-best' training data is bringing an error.

0
votes

There is very little information on this topic. One option to enhance sub/superscript character recognition (even if not the position itself) is by preprocessing the image, with cv2 / pil (also pillow) e.g., and then tesseract it.

See How to detect subscript numbers in an image using OCR?

Related (but otherwise not answering the question):

https://www.mail-archive.com/[email protected]/msg19434.html

https://github.com/tesseract-ocr/tesseract/blob/master/src/ccmain/superscript.cpp

0
votes

what do you guys think about getting tesseract to recognize single letters?

Tesseract does not recognize single characters

I tried it with the option --psm 10

tesseract imTstg.png out5 --psm 10

but it did not seem to work. I am thinking about just running yolo to detect the single letters.