1
votes

I've noticed that when I use an OCR to transform a scanned PDF document into text, in this case Adobe Acrobat Pro, I'm getting very different outputs depending on how I extract the data.

enter image description here

In the above photo - you can see a piece of a PDF that has been OCR'ed into fairly good quality text. If I select it in Adobe and copy it to say, a word or txt doc, it paste over perfectly fine.

enter image description here

However, if I export it using Adobe to Rich Text Format, use Python's PDFminer, or Python Apache Tika then I get the above photo which as you can see completely jumbles it. The extraction results are very consistent between the approaches - basically all 3 jumble it in the exact same way.

Would any of you have any idea as to why an OCR'd PDF can be copied just fine to a text editor but is extracting in such a bizarre way?

Thank you!

Regards, Mano

1
One is text extraction and one is image extraction.Tilman Hausherr
Right - but why would text extraction consistently mess up what otherwise seems like a perfectly fine image extraction ? I could see it if the image extracting was messing up the PDF badly but in this case it can easily be copy-pasted to another text document and comes out just fine. Perhaps I simply don't understand PDF text extraction?manofone
Please share your PDF.Tilman Hausherr

1 Answers

0
votes

So what ended up working for me was running the initial parsing with Apache-Tika and then, on the few that didn't work on, pass them through PyPDF2. My theory is that PyPDF2 uses a different mechanism for parsing that doesn't rely on the root of the PDF unlike Tika and that is what seems to have become corrupted in a few of these OCR'd docs.

Not sure of the initial cause but that was my solution.