Tika duplicates text when used with Tesseract on OCR PDF

Question

I have a scanned PDF that has been OCRed and now has double layer of a scanned image and a text above it.

If I use Tika with integrated Tesseract to extract text from that PDF I get duplicate text: one comes from OCRed text and another from OCRing image by Tesseract.

I need only OCRed text in this case.

I can't just disable Tesseract because there may be PDFs containing only images or PDFs that contain text and images.

Tesseract is integrated in Tika like in Apache Tika extract scanned PDF files

Is there any way to tell Tika to not use Tesseract for images inside PDF that have OCR text over them?

Sorry if it looks like an ad, but you can use Ambar to avoid problems with Tika's OCR. We put quite an effort to make it work smooth. — Ilia P

Trinadh Gupta Trinadh Gupta · Accepted Answer · 2017-03-03T14:43:28

We had a similar problem, we tried to keep a simple if else condition, where we pass the pdf to default pdf scanner, and if it turns empty then we invoke with tesseract option on pdf.

Tika duplicates text when used with Tesseract on OCR PDF

1 Answers