I am using ghostscript to merge PDF files. But occasionally embedded font names collide among different files, ghostscript will pick one subset, and some characters from other subsets of the same name cannot be rendered after merging.
To solve the problem, I'd like to add a preprocess phase that renames embedded fonts for each file, and the new names are generated by my logic.
Solutions under Linux are preferred.
P.S. I have evaled other tools to merge pdf (pdfbox, pdfjam, pdftk, pdfunite, qpdf), but it looks none of them identify same images, and the merged PDF is large. GhostScript only keeps 1 object for exactly same images in multiple input files, and it fits my scenario.
Update after reading reply from @KenS
GhostScript version: 9.18
PDF creator:
- xelatex: XeTeX 3.14159265-2.6-0.99998 (TeX Live 2017)
- xdvipdfmx: Version 20170318 by the DVIPDFMx project team, modified for TeX Live.
The output of 2 PDF with collision font names:
$ gs -q -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/prepress -sDEVICE=pdfwrite -sOutputFile=merged.pdf 1.pdf 2.pdf
GPL Ghostscript 9.18: Missing glyph CID=120, glyph=0078 in the font BLTQUA+LMRoman9-Regular . The output PDF may fail with some viewers.
GPL Ghostscript 9.18: Missing glyph CID=117, glyph=0075 in the font BLTQUA+LMRoman9-Regular . The output PDF may fail with some viewers.
GPL Ghostscript 9.18: Missing glyph CID=118, glyph=0076 in the font BLTQUA+LMRoman9-Regular . The output PDF may fail with some viewers.
GPL Ghostscript 9.18: Missing glyph CID=116, glyph=0074 in the font BLTQUA+LMRoman9-Regular . The output PDF may fail with some viewers.
Embedded fonts:
$ pdffonts 1.pdf
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
ITLHBL+LMRoman10-Regular-Identity-H CID Type 0C Identity-H yes yes yes 7 0
BLTQUA+LMRoman9-Regular-Identity-H CID Type 0C Identity-H yes yes yes 9 0
MHRCBY+LMRoman8-Regular-Identity-H CID Type 0C Identity-H yes yes yes 12 0
$ pdffonts 2.pdf
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
ITLHBL+LMRoman10-Regular-Identity-H CID Type 0C Identity-H yes yes yes 7 0
BLTQUA+LMRoman9-Regular-Identity-H CID Type 0C Identity-H yes yes yes 9 0
MHRCBY+LMRoman8-Regular-Identity-H CID Type 0C Identity-H yes yes yes 12 0
The fonts names are exactly the same. Because I use xelatex to programmatically generate PDFs in a pattern, the object ids of fonts are exactly the same. And GhostScript considers BLTQUA+LMRoman9-Regular
fonts from 2 files are the same subset, and complains at processing time.
As @KenS suggested, I let GhostScript to generate a new file for each PDF.
Ghostscript will calculate a prefix using the MD5 sum of the font contents.
Then check fonts:
$ pdffonts preproc_1.pdf
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
JUVZAM+LMRoman8-Regular CID Type 0C Identity-H yes yes yes 22 0
DCQLFZ+LMRoman9-Regular CID Type 0C Identity-H yes yes yes 17 0
YAKIEH+LMRoman10-Regular CID Type 0C Identity-H yes yes yes 13 0
$ pdffonts preproc_2.pdf
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
JUVZAM+LMRoman8-Regular CID Type 0C Identity-H yes yes yes 22 0
EQFACS+LMRoman9-Regular CID Type 0C Identity-H yes yes yes 17 0
YAKIEH+LMRoman10-Regular CID Type 0C Identity-H yes yes yes 13 0
Now, it is obvious that LMRoman9-Regular
are not the same subsets (though still with the same object id), and this will not confuse GhostScript any more.