0
votes

I am using ghostscript to merge PDF files. But occasionally embedded font names collide among different files, ghostscript will pick one subset, and some characters from other subsets of the same name cannot be rendered after merging.

To solve the problem, I'd like to add a preprocess phase that renames embedded fonts for each file, and the new names are generated by my logic.

Solutions under Linux are preferred.

P.S. I have evaled other tools to merge pdf (pdfbox, pdfjam, pdftk, pdfunite, qpdf), but it looks none of them identify same images, and the merged PDF is large. GhostScript only keeps 1 object for exactly same images in multiple input files, and it fits my scenario.


Update after reading reply from @KenS

GhostScript version: 9.18

PDF creator:

  • xelatex: XeTeX 3.14159265-2.6-0.99998 (TeX Live 2017)
  • xdvipdfmx: Version 20170318 by the DVIPDFMx project team, modified for TeX Live.

The output of 2 PDF with collision font names:

$ gs -q -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/prepress -sDEVICE=pdfwrite -sOutputFile=merged.pdf 1.pdf 2.pdf
GPL Ghostscript 9.18: Missing glyph CID=120, glyph=0078 in the font BLTQUA+LMRoman9-Regular . The output PDF may fail with some viewers.
GPL Ghostscript 9.18: Missing glyph CID=117, glyph=0075 in the font BLTQUA+LMRoman9-Regular . The output PDF may fail with some viewers.
GPL Ghostscript 9.18: Missing glyph CID=118, glyph=0076 in the font BLTQUA+LMRoman9-Regular . The output PDF may fail with some viewers.
GPL Ghostscript 9.18: Missing glyph CID=116, glyph=0074 in the font BLTQUA+LMRoman9-Regular . The output PDF may fail with some viewers.

Embedded fonts:

$ pdffonts 1.pdf
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
ITLHBL+LMRoman10-Regular-Identity-H  CID Type 0C       Identity-H       yes yes yes      7  0
BLTQUA+LMRoman9-Regular-Identity-H   CID Type 0C       Identity-H       yes yes yes      9  0
MHRCBY+LMRoman8-Regular-Identity-H   CID Type 0C       Identity-H       yes yes yes     12  0

$ pdffonts 2.pdf
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
ITLHBL+LMRoman10-Regular-Identity-H  CID Type 0C       Identity-H       yes yes yes      7  0
BLTQUA+LMRoman9-Regular-Identity-H   CID Type 0C       Identity-H       yes yes yes      9  0
MHRCBY+LMRoman8-Regular-Identity-H   CID Type 0C       Identity-H       yes yes yes     12  0

The fonts names are exactly the same. Because I use xelatex to programmatically generate PDFs in a pattern, the object ids of fonts are exactly the same. And GhostScript considers BLTQUA+LMRoman9-Regular fonts from 2 files are the same subset, and complains at processing time.

As @KenS suggested, I let GhostScript to generate a new file for each PDF.

Ghostscript will calculate a prefix using the MD5 sum of the font contents.

Then check fonts:

$ pdffonts preproc_1.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
JUVZAM+LMRoman8-Regular              CID Type 0C       Identity-H       yes yes yes     22  0
DCQLFZ+LMRoman9-Regular              CID Type 0C       Identity-H       yes yes yes     17  0
YAKIEH+LMRoman10-Regular             CID Type 0C       Identity-H       yes yes yes     13  0

$ pdffonts preproc_2.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
JUVZAM+LMRoman8-Regular              CID Type 0C       Identity-H       yes yes yes     22  0
EQFACS+LMRoman9-Regular              CID Type 0C       Identity-H       yes yes yes     17  0
YAKIEH+LMRoman10-Regular             CID Type 0C       Identity-H       yes yes yes     13  0

Now, it is obvious that LMRoman9-Regular are not the same subsets (though still with the same object id), and this will not confuse GhostScript any more.

1

1 Answers

2
votes

[insert usual disclaimer about the fact that Ghostscript does not merge PDF files]

Note that this is really only a problem when the creating application does a poor job of selecting the prefix for the embedded font name. Realistically the fault lies with the PDF creator.

You haven't stated which version of Ghostscript you are using. Recent versions of Ghostscript use both the font name and the PDF object number to try and give a greater degree of uniqueness. So the fonts will only collide if the name and object number in the two PDF files are the same, which is less likely.

If that's still a problem, a practical solution is to pass each of the original PDF files through Ghostscript and the pdfwrite device, to produce a number of new PDF files. When creating the fonts in the new PDF files, Ghostscript will calculate a prefix using the MD5 sum of the font contents. While not absolutely unbreakable, the chances of two different subsets having contents which produce the same MD5 hash is very low.

You can then safely process the newly created PDF files with no real risk that different fonts will have the same name and object number.

If you insist on doing the renaming yourself you might be able to get away with just looking through the PDF file for names of the for XXXXX+FontName. You could modify the 5 letter prefix and rewrite the file.

I can't recall offhand if font objects can be stored in compressed object streams, if they can that would significantly increase the problem, because you would have to decompress the stream, modify the data, recompress it, and, most likely modify the xref table, because its unlikely the recompressed stream would be the same length as the original.