1
votes

Im working on an OCR task for multiple PDF files. Some of them are scanned (non searchable) and others are just native (searchable) PDF's.

I have two separate code executions in order to gather the text data.

The one for the scanned pdf

image1 = image_read_pdf (file.list1[1], density=150)
image1 = image_ocr(image1,
                      language = "spa")

The one for the text pdf

text1 = pdf_text(file.list1[2])

Since the OCR function takes a while on each archive, i would like to be able to differentiate both kinds of PDF before i convert them to text. Is there a way i could identify them?

I have tried pdf_fonts(file.list1[1]) but im not able to get a conclusive result in order to differentiate a scanned pdf from a native text pdf.

2

2 Answers

0
votes

Ive been thinking using the resulting tibble size. I dont know if those font errors can cause me trouble later?

Native PDF

> nrow(pdf_fonts(file.list1[2])) * ncol(pdf_fonts(file.list1[2]))
PDF error: No display font for 'ArialUnicode'
PDF error: Couldn't find a font for 'Helvetica', subst is 'Helvetica'
PDF error: Couldn't find a font for 'Helvetica-Bold', subst is 'Helvetica'
PDF error: No display font for 'ArialUnicode'
PDF error: Couldn't find a font for 'Helvetica', subst is 'Helvetica'
PDF error: Couldn't find a font for 'Helvetica-Bold', subst is 'Helvetica'
[1] 8

SCANNED PDF

> nrow(pdf_fonts(file.list1[1])) * ncol(pdf_fonts(file.list1[1]))
[1] 0
0
votes

Just a brief answer as the PDF Standard is world of it's own!

I've used the following in a batch file on Windows:

@echo off

for /f "delims=" %%a in ('findstr /i /m /c:"/Type /Font" *.pdf') do ( echo %%a goto :continue ) :continue

The above could be run on all PDF's before processing and divide them into different folders i.e. replace echo %%a with an if statement...

I was dealing with thousands of PDF's and was also looking for a distinction between image pdf's and text pdf's. The "/Type /Font" I found was embedded as part of the text type pdf but not in an image pdf. Note: I didn't have both mixed text/image in the same pdf but if this was the case I'd say "/Type /Font" would be a part of the PDF.

One would have to read the PDF Standard in detail to figure out if there is an absolute way of determining a distinction between the image and text type PDF's.