0
votes

I am using pdftotext to convert pdf files to txt files.

I tested the code on few files and worked fine but when I run the code on every pdf files I have (about 2000 files) it return error poppler error creating document

Here is the code

import pdftotext
import os

directory = "/testfiles" # path where PDF files are saved

for filename in os.listdir(directory):
    if filename.endswith(".pdf"):
        pathname = os.path.join(directory, filename)
        
        with open(pathname, 'rb') as f:
            pdf = pdftotext.PDF(f) # ERROR : poppler error creating document
            
        txtname = pathname.replace('.pdf', '.txt')
        with open(txtname, 'w', encoding='utf-8') as text_file: # edit: encoding utf-8 added
            for page in pdf:
                text_file.write(page)
        continue

What is the problem?

I googled this error and the only solution I found was to update poppler to the latest version but I installed poppler yesterday so I guess there is no need for me to update.

I also tried using pdfplumber but it returned “No /Root object! - Is this really a PDF?”. Do both errors have something to do with the pdf file itself?

I was able to open the file without any error so I guess files are not corrupted.