0
votes

I am using TIKA and Tesseract for OCR text extraction from pdf files that contain scanned images. I have managed to parse pdf documents containing images with ResursiveParserWrapper instead of Parser and it is working fine however the client wants to do all the configurations related to Tesseract OCR somewhere else and use existing code as it is to extract OCR text extraction from all supported formats.

The existing code uses simple Parser to extract data. Can anybody help me and explain why we use RecursiveParserWrapper instead of normal Parser when we are going to extract data from images or pdfs containing scanned images.

1

1 Answers

1
votes

There are 3 benefits to the RecursiveParserWrapper.

  1. maintains metadata from embedded documents
  2. records stacktraces from parse exceptions in embedded documents
  3. easier to identify what came from main document and what came from embedded docs/attachments

If you don’t care about these, then you should be able to extract the same text with the AutoDetectParser and the RecursiveParserWrapper. If you do see a difference in the text extracted, please open a ticket on Tika’s JIRA.

Also note, if you’re using an old version of Tika ( < 1.15), you need to supply the Parser for embedded documents in the ParseContext for each parse; if you don’t do this in the older versions Tika did not parse any embedded docs.