2
votes

Assume my user went to a scanner in their office. The scanner is capable of generating a PDF of the scanned document. This is essentially the type of file that I have.

What I want to do is extract the text from this PDF. This is not a "first generation" pdf in the sense that the text is not embedded into the pdf. The text is embedded in the image that is in the PDF.

Is there functionality in iText of PDFBox that allows for this data to be retrieved? I am trying to avoid doing OCR on the image if possible. I was hoping there was something build into IText or PDFBox that does this.

Note that I am not talking about extracting "normal" text form a pdf as is outlined here: How to get raw text from pdf file using java

1
Your question might be clearer if you removed the mention of pdf entirely. Essentially you're wanting to read text from an image, if I'm reading this correctly.cadams
You want to do OCR without doing OCR. PDFBox and iText can only extract text that is stored as vector data. You want to get text that consists of pixels in a raster image. That's OCR. Neither PDFBox, nor iText support OCR.Bruno Lowagie
@cadams Yes, but on a PDF. I do not want to convert it to an image. It has to be done on the PDF itself.user489041
@BrunoLowagie I suppose what I meant was I do not want to use a third party library that does OCR. I was hoping that PDFBox or iText can do this. Im actually fairly sure that they can. I just need to figure out how to plug that functionality into it.user489041
Right. As far as I'm aware, what you're wanting to do is not possible. However, you can use a java wrapper for tesseract like tesjeract or Tess4J but you will have to convert the pdf to a png or tiff image format, which you seem to be trying to avoid.cadams

1 Answers

5
votes

Ok, after some looking around, there doesn't seem to be a way to do this specifically with iText or PDFBox, but it looks like PDFBox does have a plugin for third-party software that can accomplish what you need. If that is of interest, links are here and here, sourced from here (from @TilmanHausherr).