How to differentiate between a scanned PDF and a regular text PDF

Question

Im working on an OCR task for multiple PDF files. Some of them are scanned (non searchable) and others are just native (searchable) PDF's.

I have two separate code executions in order to gather the text data.

The one for the scanned pdf

image1 = image_read_pdf (file.list1[1], density=150)
image1 = image_ocr(image1,
                      language = "spa")

The one for the text pdf

text1 = pdf_text(file.list1[2])

Since the OCR function takes a while on each archive, i would like to be able to differentiate both kinds of PDF before i convert them to text. Is there a way i could identify them?

I have tried pdf_fonts(file.list1[1]) but im not able to get a conclusive result in order to differentiate a scanned pdf from a native text pdf.

Andres Mora Andres Mora · Accepted Answer · 2021-04-09T18:33:52

Ive been thinking using the resulting tibble size. I dont know if those font errors can cause me trouble later?

Native PDF

> nrow(pdf_fonts(file.list1[2])) * ncol(pdf_fonts(file.list1[2]))
PDF error: No display font for 'ArialUnicode'
PDF error: Couldn't find a font for 'Helvetica', subst is 'Helvetica'
PDF error: Couldn't find a font for 'Helvetica-Bold', subst is 'Helvetica'
PDF error: No display font for 'ArialUnicode'
PDF error: Couldn't find a font for 'Helvetica', subst is 'Helvetica'
PDF error: Couldn't find a font for 'Helvetica-Bold', subst is 'Helvetica'
[1] 8

SCANNED PDF

> nrow(pdf_fonts(file.list1[1])) * ncol(pdf_fonts(file.list1[1]))
[1] 0

How to differentiate between a scanned PDF and a regular text PDF

2 Answers