Can Tesseract OCR recognize subscripts and superscripts?

Question

I have problems with the general recognition of subscript and superscript in text fragments.

Example-image:

I used Tesseract 4.1.1 with the training data available under https://github.com/tesseract-ocr/tessdata_best. The numerous options had default values except:

tessedit_create_hocr = 1 (to get result as HOCR)
hocr_font_info = 1 (to get additional font infos like font size)
hocr_char_boxes = 1 (to get character-based result)

The language was set to eng. Neither with page segmentation mode 3 (PSM_AUTO_OSD) nor 11 (PSM_SPARSE_TEXT) nor 12 (PSM_SPARSE_TEXT_OSD) the subscript/superscript was recognized correctly.

In the output the sub/sup-fragments were all more or less wrong:

"Subtext_Sub" is recognized as "Subtextsu,"
"Suptext^Sub" is recognized as "Suptexts?"
"P₀" is recognized as "Po"
"P₁₀₀" is recognized as "P1go"
"a²+b²" is recognized as "a+b?"

Using Tesseract for OCR is there a way to ...?

optimize subscript/superscript handling
get infos about recognized subscript/superscript (in the hocr-output - ideally for each character)

To give a little bit of context: Superscripts and subscripts are important when it comes to chemical formulas. Superscripts are also used for footnotes. The distinction to normal text is relevant when the superscript is after a number: Revenue in Q1 (in million USD): 54² is very different from Revenue in Q1 (in million USD): 542 — Martin Thoma

MaS MaS · Accepted Answer · 2020-09-22T06:52:21

Working on the quality of the image as suggested in other questions/answers to this topic didn't really change anything.

Following these 2 links from the tesseract-google-newsgroup at first it really seemed to be a question of training: link1 and link2.

But after doing some experiments I found out, that the used OEM_DEFAULT-OCR engine mode just doesn't bring up the needed information. I found a partial solution to the problem. Partial, because I now get most infos about sub/sup and also the recognized characters are right in most cases, but not for all characters.

Using the OEM_TESSERACT_ONLY-OCR engine mode (=the legacy mode) and some API methods provided by Tess4J I came up with the following java test class:

public class SubSupEvaluator {
    public void determineSubSupCharacters(BufferedImage image) {
        //1. initialize Tesseract and set image infos
        TessBaseAPI handle = TessAPI1.TessBaseAPICreate();
        try {
            int bpp = image.getColorModel().getPixelSize();
            int bytespp = bpp / 8;
            int bytespl = (int) Math.ceil(image.getWidth() * bpp / 8.0);
            TessBaseAPIInit2(handle, new File("./tessdata/").getAbsolutePath(), "eng", TessOcrEngineMode.OEM_TESSERACT_ONLY);
            TessBaseAPISetPageSegMode(handle, TessPageSegMode.PSM_AUTO_OSD);
            TessBaseAPISetImage(handle, ImageIOHelper.convertImageData(image), image.getWidth(), image.getHeight(), bytespp, bytespl);

            //2. start actual OCR run
            TessBaseAPIRecognize(handle, null);

            //3. iterate over the result character-wise
            TessResultIterator ri = TessBaseAPIGetIterator(handle);
            TessPageIterator pi = TessResultIteratorGetPageIterator(ri);
            TessPageIteratorBegin(pi);
            do {
                //determine character
                Pointer ptr = TessResultIteratorGetUTF8Text(ri, TessPageIteratorLevel.RIL_SYMBOL);
                String character = ptr.getString(0);
                TessDeleteText(ptr); //release memory

                //determine position information
                IntBuffer leftB = IntBuffer.allocate(1);
                IntBuffer topB = IntBuffer.allocate(1);
                IntBuffer rightB = IntBuffer.allocate(1);
                IntBuffer bottomB = IntBuffer.allocate(1);
                TessPageIteratorBoundingBox(pi, TessPageIteratorLevel.RIL_SYMBOL, leftB, topB, rightB, bottomB);

                //write info to console
                System.out.println(String.format("%s - position [%d %d %d %d], subscript: %b, superscript: %b", character, leftB.get(), topB.get(),
                    rightB.get(), bottomB.get(), TessAPI1.TessResultIteratorSymbolIsSubscript(ri) == TessAPI1.TRUE,
                    TessAPI1.TessResultIteratorSymbolIsSuperscript(ri) == TessAPI1.TRUE));
            } while (TessPageIteratorNext(pi, TessPageIteratorLevel.RIL_SYMBOL) == TessAPI1.TRUE);
        } finally {
            TessBaseAPIDelete(handle); //release memory
        }
    }
}

The legacy mode only works with 'normal' training data. Using the '-best' training data is bringing an error.

Can Tesseract OCR recognize subscripts and superscripts?

3 Answers