Use PyPDF2 to detect non-embedded fonts in PDF file generated by Google Docs

Question

I was hoping someone could help me write a Python function to detect any fonts in the file which are not embedded in the file. I've attempted to use the script linked here, and it can detect the documents fonts, but it does not detect fonts which are embedded. I've pasted the script below for convenience:

from PyPDF2 import PdfFileReader
import sys

fontkeys = set(['/FontFile', '/FontFile2', '/FontFile3'])

def walk(obj, fnt, emb):
    if '/BaseFont' in obj:
        fnt.add(obj['/BaseFont'])

    elif '/FontName' in obj and fontkeys.intersection(set(obj)):
        emb.add(obj['/FontName'])

    for k in obj:
        if hasattr(obj[k], 'keys'):
            walk(obj[k], fnt, emb)

    return fnt, emb

if __name__ == '__main__':
    fname = sys.argv[1]
    pdf = PdfFileReader(fname)
    fonts = set()
    embedded = set()

    for page in pdf.pages:
        obj = page.getObject()
        f, e = walk(obj['/Resources'], fonts, embedded)
        fonts = fonts.union(f)
        embedded = embedded.union(e)

    unembedded = fonts - embedded
    print 'Font List'
    pprint(sorted(list(fonts)))
    if unembedded:
        print '\nUnembedded Fonts'
        pprint(unembedded)

For example, I've downloaded a PDF from Google Docs (type some stuff, save as PDF) with the Arial font, and Adobe Reader has confirmed that the font is embedded. However, the script returns ['/ArialMT'] as a font, and an empty set for embedded fonts. Additionally, it does not look like any of the recursive objects have the keys {'/FontFile', '/FontFile2', '/FontFile3'}. I've tried it on other PDFs and it works, so it must be something weird with the Google Docs PDFs. Let me know what other debug information I can give for this PDF file.

One thing I thought was that it was possible that Google Docs was only embedding fonts which were not in the 14 standard PDF fonts. However, i tried it with a weird font (pacifico), and the script also stated this font was not embedded, when Adobe claims it is.

I tried it with this PDF, and the script correctly stated that these 14 fonts were embedded.

Arya Arya · Accepted Answer · 2020-03-27T23:11:21

The issue is that this script does not handle lists. For example in the Google Docs example, in the PDF object, you see this structure:

{'/Encoding': '/Identity-H', '/Type': '/Font', '/BaseFont': '/Pacifico-Regular', '/ToUnicode': IndirectObject(9, 0), '/DescendantFonts': [IndirectObject(16, 0)], '/Subtype': '/Type0'}

The key DescendantFonts maps to a list of values, which if you recurse deeper into will contain the keys for font files. You have to modify the script to test for arrays as well, for example:

if type(obj) == PyPDF2.generic.ArrayObject:  # You can also do ducktyping here
    for i in obj:
        if hasattr(i, 'keys'):
            walk(i, all_fonts, embedded_fonts)

Use PyPDF2 to detect non-embedded fonts in PDF file generated by Google Docs

1 Answers