Tell if text of PDF is visible or not

Question

I'm parsing some PDF files using the pdfminer library.

I need to know if the document is a scanned document, where the scanning machine places the scanned image on top and OCR-extracted text in the background.

Is there a way to identify if text is visible, as OCR machines do place it on the page for selection.

Generally the problem is distinguishing between two very different, but similar looking cases.

In one case there's an image of a scanned document that covers most of the page, with the OCR text behind it.

Here's the PDF as text with the image truncated: http://pastebin.com/a3nc9ZrG

In the other case there's a background image that covers most of the page with the text in front of it.

Telling them apart is proving difficult for me.

David van Driessche David van Driessche · Accepted Answer · 2015-08-04T14:15:31

Your question is a bit confusing so I'm not really sure what is going to help you the most. However, you describe two ways to "hide" text from OCR. Both I think are detectable but one is much easier than the other.

Hidden text
Hidden text is regular or invisible text that is placed behind something else. In other words, you use the stacking order of objects to hide some of them. The only way you can detect this type of case is by figuring out where all of the text objects on the page are (calculating their bounding boxes isn't trivial but certainly possible) and then figuring out whether any of the images on the page overlaps that text and is in front of it. Some additional comments:

Theoretically it could be something else than an image hiding it, but in your OCR case I would guess it's always an image.
Though an image may be overlapping it, it may also be transparent in some way. In that case, the text that is underneath may still shine through. In your case of a general OCR engine, probably not likely.

Invisible text
PDF supports invisible text. More precisely, PDF supports different text rendering modes; those rendering modes determine whether characters are filled, outlined, filled + outlined, or invisible (there are other possibilities yet). In the PDF file you posted, you find this fragment:

BT
3 Tr
0.00 Tc
/F3 8.5 Tf
1 0 0 1 42.48 762.96 Tm
(Chicken ) Tj

That's an invisible chicken right there! The instruction "3 Tr" sets the text rendering mode to "3", which is equal to "invisible" or "neither stroked nor filled" as the PDF specification very elegantly puts it.

It's worthwhile mentioning that these two techniques can be used interchangeably by OCR engines. Placing invisible text on top of a scanned image is actually good practice because it means that most PDF viewers will allow you to select the text. Some PDF viewers that I looked at at some point didn't allow text selection if the text was "behind" the image.

Tell if text of PDF is visible or not

2 Answers