I have a scanned PDF that has been OCRed and now has double layer of a scanned image and a text above it.
If I use Tika with integrated Tesseract to extract text from that PDF I get duplicate text: one comes from OCRed text and another from OCRing image by Tesseract.
I need only OCRed text in this case.
I can't just disable Tesseract because there may be PDFs containing only images or PDFs that contain text and images.
Tesseract is integrated in Tika like in Apache Tika extract scanned PDF files
Is there any way to tell Tika to not use Tesseract for images inside PDF that have OCR text over them?