I've noticed that when I use an OCR to transform a scanned PDF document into text, in this case Adobe Acrobat Pro, I'm getting very different outputs depending on how I extract the data.
In the above photo - you can see a piece of a PDF that has been OCR'ed into fairly good quality text. If I select it in Adobe and copy it to say, a word or txt doc, it paste over perfectly fine.
However, if I export it using Adobe to Rich Text Format, use Python's PDFminer, or Python Apache Tika then I get the above photo which as you can see completely jumbles it. The extraction results are very consistent between the approaches - basically all 3 jumble it in the exact same way.
Would any of you have any idea as to why an OCR'd PDF can be copied just fine to a text editor but is extracting in such a bizarre way?
Thank you!
Regards, Mano