Assume my user went to a scanner in their office. The scanner is capable of generating a PDF of the scanned document. This is essentially the type of file that I have.
What I want to do is extract the text from this PDF. This is not a "first generation" pdf in the sense that the text is not embedded into the pdf. The text is embedded in the image that is in the PDF.
Is there functionality in iText of PDFBox that allows for this data to be retrieved? I am trying to avoid doing OCR on the image if possible. I was hoping there was something build into IText or PDFBox that does this.
Note that I am not talking about extracting "normal" text form a pdf as is outlined here: How to get raw text from pdf file using java