Im working on an OCR task for multiple PDF files. Some of them are scanned (non searchable) and others are just native (searchable) PDF's.
I have two separate code executions in order to gather the text data.
The one for the scanned pdf
image1 = image_read_pdf (file.list1[1], density=150)
image1 = image_ocr(image1,
language = "spa")
The one for the text pdf
text1 = pdf_text(file.list1[2])
Since the OCR function takes a while on each archive, i would like to be able to differentiate both kinds of PDF before i convert them to text. Is there a way i could identify them?
I have tried pdf_fonts(file.list1[1])
but im not able to get a conclusive result in order to differentiate a scanned pdf from a native text pdf.