I am using tika-app jar for my project and is there a way to disable tesseract OCR in tika. There are two things which has to be kept as such:
1.tesseract cannot be uninstalled
2.tika.xml can't be edited, as tika-app.jar is used off the shelf
Is there a way to set the configuration in the java code by setting the context or parser property to disable OCR?
I tried the below code but still OCR extracts the text from image files while parsing.
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setOcrStrategy(OCR_STRATEGY.NO_OCR);
context.set(PDFParserConfig.class, pdfConfig);```
TikaConfig
object with your own settings, then pass that to the Tika code you're using – Gagravarr