2
votes

I have this large print file in pdf that's contains 5544 pages and is about 36mb in size. The file is created by MS Word 2010 and contains only text and a logo on each letter/document.

I split it into 5544 files and merge back into 2770 letters, based on keywords. Each letter is approx. 140-145kb.

When I merge all the letters into a new pdf print file, still containing 5544 pages, the size of the file is grown to 396mb.

All text extracting, splitting and merging is performed with calls to Apache PDFBox command-line tools from PHP, but result is the same when run from a console.

Any idea how to reduce the file size of the letters and the final print file? It seems like PDFBox has just appended each letters in the final print file, instead creating a new pdf-document.

It's only in the testing phase that all the documents are merged into the final print file, some of the documents will be send by email.

I have also tried SAMBox (a fork of PDFBox) but with nearly the same result:

pdfinfo Original.pdf Title: Printfile Author: Claus Hjort Bube Creator: Microsoft® Word 2010 Producer: Microsoft® Word 2010 CreationDate: Fri May 19 12:16:34 2017 CEST ModDate: Fri May 19 12:16:34 2017 CEST Tagged: yes UserProperties: no Suspects: no Form: none JavaScript: no Pages: 5544 Encrypted: no Page size: 595.32 x 841.92 pts (A4) Page rot: 0 File size: 36092281 bytes Optimized: no PDF version: 1.5

pdfinfo PDFBox.pdf Title: Printfile Author: Claus Hjort Bube Creator: Microsoft® Word 2010 Producer: Microsoft® Word 2010 CreationDate: Fri May 19 12:16:34 2017 CEST ModDate: Fri May 19 12:16:34 2017 CEST Tagged: no UserProperties: no Suspects: no Form: none JavaScript: no Pages: 5544 Encrypted: no Page size: 595.32 x 841.92 pts (A4) Page rot: 0 File size: 396622354 bytes Optimized: no PDF version: 1.4

pdfinfo SAMBox.pdf Creator: Sejda Console 3.2.17 Producer: SAMBox 1.1.8 (www.sejda.org) ModDate: Tue Jul 11 23:34:33 2017 CEST Tagged: no UserProperties: no Suspects: no Form: none JavaScript: no Pages: 5544 Encrypted: no Page size: 595.32 x 841.92 pts (A4) Page rot: 0 File size: 378779436 bytes Optimized: no PDF version: 1.7

1
This indeed can happen. In the original file most likely resources were shared, e.g. there was just one copy of the image all pages referred to. When the file was split, each partial PDF got its own copy of each shared resource. So after merging those partial PDFs again, each page has its own copy of each formerly shared resource. This makes the file size explode. Unfortunately PDFBox does not (yet?) have a smart merger to recognize identical resources and reduce them to a single copy.mkl
Thus, instead of splitting the original PDF and putting a selection of the split pages together again, you should always start from the original file and reduce it by removing all unwanted pages.mkl

1 Answers

0
votes

That may sound sad but it is correct. When splitting, each file gets the resources (e.g. fonts and company logo graphic) it needs. When merged back, PDFBox does not know that these may be the same over the whole document, so these are now duplicated a lot.

The only solution I see for you would be to use the PDFBox java API to create the mailing files and the final print file in one step, i.e. without creating single files that are merged back.