I just installed Tika from the Github's repository and tried to OCR a PDF which contains scanned document pages.
java -cp tika-app/target/tika-app-1.17-SNAPSHOT.jar org.apache.tika.cli.TikaCLI /tmp/testing/sample_scanned.pdf
However, only metadata gets extracted (although I got confirmation beforehand that Tesseract is installed and utilized:
WARNING: Tesseract OCR is installed and will be automatically applied to image files unless you've excluded the TesseractOCRParser from the default parser. Tesseract may dramatically slow down content extraction (TIKA-2359). As of Tika 1.15 (and prior versions), Tesseract is automatically called. In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
Note: Regular PDFs (containing) plain text gets extract successfully. The problem seems to be the OCR process itself.
This has been tested on Centos as well as Ubuntu - same issue.
Do I need to make changes to config files, specify more parsers? What could cause this?
Thank you.