Is there a way to disable OCR mode in Tika without uninstalling tesseract

Question

I am using tika-app jar for my project and is there a way to disable tesseract OCR in tika. There are two things which has to be kept as such:

1.tesseract cannot be uninstalled

2.tika.xml can't be edited, as tika-app.jar is used off the shelf

Is there a way to set the configuration in the java code by setting the context or parser property to disable OCR?

I tried the below code but still OCR extracts the text from image files while parsing.

            PDFParserConfig pdfConfig = new PDFParserConfig();
            pdfConfig.setOcrStrategy(OCR_STRATEGY.NO_OCR);
            context.set(PDFParserConfig.class, pdfConfig);```

The Tika App will happily accept a Tika Config xml file passed as a command line argument, why not do that? — Gagravarr
Tika app is used as an external library file and it is configured that way. Is it possible to set it through java code? — Santhosh
Sure! Just create a TikaConfig object with your own settings, then pass that to the Tika code you're using — Gagravarr

suraj huljute suraj huljute · Accepted Answer · 2019-09-25T07:36:11

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
       <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
        </parser>
    </parsers>
</properties>

Is there a way to disable OCR mode in Tika without uninstalling tesseract

1 Answers