0
votes

I just installed Tika from the Github's repository and tried to OCR a PDF which contains scanned document pages.

java -cp tika-app/target/tika-app-1.17-SNAPSHOT.jar org.apache.tika.cli.TikaCLI /tmp/testing/sample_scanned.pdf

However, only metadata gets extracted (although I got confirmation beforehand that Tesseract is installed and utilized:

WARNING: Tesseract OCR is installed and will be automatically applied to image files unless you've excluded the TesseractOCRParser from the default parser. Tesseract may dramatically slow down content extraction (TIKA-2359). As of Tika 1.15 (and prior versions), Tesseract is automatically called. In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.

(Full output)

Note: Regular PDFs (containing) plain text gets extract successfully. The problem seems to be the OCR process itself.

This has been tested on Centos as well as Ubuntu - same issue.

Do I need to make changes to config files, specify more parsers? What could cause this?

Thank you.

1
Still looking for solution. Do I need to specify the OCR part somewhere in configuration for it to be used. If so, why is there a warning message displayed stating that 'Tesseract OCR is installed and will be automatically applied?' (as posted above).Gugols
It seems to be related to the PDF Parser. I just ran into the same issue - parsing a .docx file with an embedded image extracts the text from the image, using the same image within a PDF file does not work though.Ben Romberg
Hi @BenRomberg were you able to resolve the issue?S. Das
On TikaServer newer this is enable by header. But it seems not to be workingS. Das

1 Answers

1
votes

Turns out PDF image extraction is disabled by default. From PDFParserConfig:

Beware: some PDF documents of modest size (~4MB) can contain thousands of embedded images totaling > 2.5 GB. Also, at least as of PDFBox 1.8.5, there can be surprisingly large memory consumption and/or out of memory errors. Set to true with caution. The default is false.

A simple example to enable it that worked for me:

Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
ParseContext parseContext = new ParseContext();
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
parseContext.set(PDFParserConfig.class, pdfConfig);
try (InputStream stream = ClasspathUtil.readStreamFromClasspath("test.pdf")) {
    parser.parse(stream, handler, new Metadata(), parseContext);
    System.out.println(handler.toString());
}