3
votes

I am using tika-app jar for my project and is there a way to disable tesseract OCR in tika. There are two things which has to be kept as such:

1.tesseract cannot be uninstalled

2.tika.xml can't be edited, as tika-app.jar is used off the shelf

Is there a way to set the configuration in the java code by setting the context or parser property to disable OCR?

I tried the below code but still OCR extracts the text from image files while parsing.

            PDFParserConfig pdfConfig = new PDFParserConfig();
            pdfConfig.setOcrStrategy(OCR_STRATEGY.NO_OCR);
            context.set(PDFParserConfig.class, pdfConfig);```
1
The Tika App will happily accept a Tika Config xml file passed as a command line argument, why not do that?Gagravarr
Tika app is used as an external library file and it is configured that way. Is it possible to set it through java code?Santhosh
Sure! Just create a TikaConfig object with your own settings, then pass that to the Tika code you're usingGagravarr

1 Answers

3
votes
<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
       <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
        </parser>
    </parsers>
</properties>