2
votes

I'm trying to dynamically generate PDFs from user input, where I basically print the user input and overlay it on an existing PDF that I did not create.

It works, with one major exception. Adobe Reader doesn't read it properly, on Windows or on Linux. QuickOffice on my phone doesn't read it either. So I thought I'd trace the path of me creating the files -

1 - Original PDF of background
PDF 1.2 made with Adobe Distiller with the LZW encoding. I didn't make this.

2 - PDF of background
PDF 1.4 made with Ghostscript. I used pdf2ps then ps2pdf on the above to strip LZW so that the reportlab and pyPDF libraries would recognize it. Note that this file looks "fuzzy," like a bad scan, in Adobe Reader, but looks fine in other readers.

3 - PDF of user-input text formatted to be combined with background
PDF 1.3 made with Reportlab from user input. Opens properly and looks good in every reader I've tried.

4 - Finished PDF
PDF 1.3 made from PyPDF's mergePage() function on 2 and 3.

Does not open in:
Adobe Reader for Windows
Adobe Reader for Linux
QuickOffice for Android

Opens perfectly in:
Google Docs' PDF viewer on the web
evince for linux
ghostscript viewer for linux Foxit reader for Windows
Preview for Mac

Are there known issues that I should know about? I don't know exactly what "flate" is, but from the internet I gather that it's some sort of open source alternative to LZW for PDF compression? Could that be causing my problem? If so, are there any libraries I could use to fix the cause in my code?

1

1 Answers

4
votes

First remark:

Your 2nd step has many, many drawbacks. If you convert PDF back to PostScript and then again back to PDF, you are going to loose quality. This process is called "re-frying PDFs", and is generally being frowned upon on the part of PDF professionals. (The reasons are: resulting files may look "fuzzy", like bad scans; files may have lost their embedded fonts; files may have replaced original fonts; files certainly have lost their transparencies; images have changed resolutions; colors have changed....)

Sometimes you have no other choice than "re-frying"... but here you DO.

If you use Ghostscript, you can do a direct PDF-to-PDF conversion of PDF files, and there will be no internal, hidden PostScript conversion happening. (This is a very rarely known feature of Ghostscript, and therefor this answer normall would deserve lots of upvotes ;-P ).

Since you do want to get rid of internal LZW compression, here is how to do it in Ghostscript:

  1. Download a little utility program, written in PostScript language, available from the Ghostscript source code repository: pdfinflt.ps

  2. Run the following commandline:

    gswin32c.exe -- [c:/path/to/]pdfinflt.ps input.pdf output.pdf

Update: This links to the last version of pdfinflt.ps. It has since been removed with this commit message:

Remove pdfinflt.ps and pdfwrite.ps
-----------------------------------
pdfwrite is only (as far as I can see) used by pdfinflt.ps which says:

% It is not yet ready for prime time, but it is available for anyone wants
% to fix it.
%
% The main problem is:
%
% 1. Sometimes the PDF files that are written are broken. When they are
%    broken, GS gets an xref problem.
%
%    This problem is actually due to lib/pdfwrite.ps since even
%    when no conversion is done, the file is may be bad.

Since it doesn't work, and we can use MuPDF (which does work) for the
same task, I've chosen to delete both these files.

The resulting PDF will have decompressed all its internal data streams, without loosing quality through your PDF ==> PS ==> PDF re-frying.

Second remark:

I think you should do your 4th step with a different tool, namely pdftk***. This has the advantage of saving you completely from going through steps 1. and 2. altogether.

pdfk (PDF ToolKit, download here) is a commandline utility, available on Linux, Unix (pdftk) and Windows (pdftk.exe), which can do a lot of things on PDFs, including overlaying the pages of two PDFs over each other. This is what I'd recommend you to use. pdftk can overlay the PDF from your step "3." to your original PDF (or vice versa) in one go without first needing to de-flate or de-LZW each one.

Here are commands for you to test:

pdftk.exe ^ original.pdf ^ background pdf-from-userinput-step3.pdf ^ output merged.pdf pdftk.exe ^ pdf-from-userinput-step3.pdf ^ background original.pdf ^ output merged.pdf pdftk.exe ^ original.pdf ^ stamp pdf-from-userinput-step3.pdf ^ output merged.pdf pdftk.exe ^ pdf-from-userinput-step3.pdf ^ stamp original.pdf ^ output merged.pdf

You'll probably wonder about the difference between the stamp and background commands. The commands do what their name suggests: order the PDF page into the foreground or the background layer. Should both PDFs have transparent backgrounds (instead of solid white opaque), the result will in many cases be looking the same.