Merging PDF files using Python and PyPDF2 throws a TypeError

Question

I am using Python 3.6.5 to merge PDFs together but am running into a problem. The code below throws a 'TypeError: 'NumberObject' object is not subscriptable' error. What am I doing wrong? When I comment out the line with the merger.append, it prints out the file paths correctly.

import webbrowser
import os
from PyPDF2 import PdfFileMerger, PdfFileReader

path = 'C:/test/pdfs'
merger = PdfFileMerger()
for pdf in os.listdir(path):
      merger.append(PdfFileReader(open(os.path.join(path,pdf), 'rb')))
      print(os.path.join(path,pdf))
merger.write(path+'/merged.pdf')
merger.close()
webbrowser.open_new(path+'/merged.pdf')

File "C:\test\pdftest.py", line 9, in merger.append(PdfFileReader(open(os.path.join(path,pdf), 'rb'))) File "C:\python\lib\site-packages\pypdf2-1.26.0-py3.6.egg\PyPDF2\pdf.py", line 1084, in init self.read(stream) File "C:\python\lib\site-packages\pypdf2-1.26.0-py3.6.egg\PyPDF2\pdf.py", line 1805, in read assert xrefstream["/Type"] == "/XRef" TypeError: 'NumberObject' object is not subscriptable

When I change the merger.append to take a file path, I get:

File "C:\test\pdftest.py", line 9, in merger.append(os.path.join(path,pdf)) File "C:\python\lib\site-packages\pypdf2-1.26.0-py3.6.egg\PyPDF2\merger.py", line 203, in append self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks) File "C:\python\lib\site-packages\pypdf2-1.26.0-py3.6.egg\PyPDF2\merger.py", line 133, in merge pdfr = PdfFileReader(fileobj, strict=self.strict) File "C:\python\lib\site-packages\pypdf2-1.26.0-py3.6.egg\PyPDF2\pdf.py", line 1084, in init self.read(stream) File "C:\python\lib\site-packages\pypdf2-1.26.0-py3.6.egg\PyPDF2\pdf.py", line 1805, in read assert xrefstream["/Type"] == "/XRef" TypeError: 'NumberObject' object is not subscriptable

UPDATE: It looks like one of the PDFs in the folder was causing this. The only thing different with that PDF is that it uses Type 1 font whereas the other PDFs use TrueType font. Does anyone know a workaround or fix for this?

File "C:\test\pdftest.py", line 9, in <module> merger.append(PdfFileReader(open(os.path.join(path,pdf), 'rb'))) File "C:\python\lib\site-packages\pypdf2-1.26.0-py3.6.egg\PyPDF2\pdf.py", line 1084, in init self.read(stream) File "C:\python\lib\site-packages\pypdf2-1.26.0-py3.6.egg\PyPDF2\pdf.py", line 1805, in read assert xrefstream["/Type"] == "/XRef" TypeError: 'NumberObject' object is not subscriptable — krazyboi
The documentation says that PdfFileMerger.append takes a file object or a pathname, not a PdfFileReader. — Dan D.
Some of the files in path are not files and are not PDF files. You need to filter those out from the result of os.listdir(path). — Dan D.
@DanD. I've updated the post to show the traceback when I change PdfFileMerger.append to take a pathname. Also, the files in path are all PDF files. I created a new folder and placed the PDFs in there manually. — krazyboi

LamerLink LamerLink · Accepted Answer · 2020-12-31T17:09:13

This seems to be caused by either unrecognised or bad PDF formatting. I'm no PDF expert but it seems PyPDF2 is complaining about a record in the XRef table. I've found the easiest way to get around this is to reformat the PDF.

What I do is put the merger.append(PDFFileReader(file)) in a try and if I find the 'NumberObject' object is not subscriptable message in the exception I "convert" the PDF with LibreOffice in headless mode via subprocess:

command = [r'"C:\Program Files\LibreOffice\program\soffice.bin"',
           '--convert-to', 'pdf', '--outdir', f'"{dest_file_path}"', f'"{file_name}"']
pdf_convert = subprocess.Popen(' '.join(command))

A note on using LibreOffice and subprocess: For whatever reason, I've found passing as a list causes an access denied error for me in Windows so that's why I do the join instead.

Merging PDF files using Python and PyPDF2 throws a TypeError

1 Answers