1
votes

I am reading in a PDF and outputting a PDF with multiple copies of the original PDF in it. I test by doing the same thing for both PDFBox and iText. iText creates a much smaller output if I duplicate each page individually.

The question: Is there another way to do this in PDFBox that results in smaller output PDFs.

For one example input file, generating two copies to the output with both tools:

  • Original PDF size: 30K
  • PDFBox (v 1.7.1) generated PDF: 84K
  • iText (v 5.3.4) generated PDF: 35K

Java code for PDFBox (sorry to inflict error handling on you). Notice how it reads the input over and over and duplicates it as a whole:

PDFMergerUtility merger = new PDFMergerUtility();
PDDocument workplace = null;
try {
    for (int cnt = 0; cnt < COPIES; ++cnt) {
        PDDocument document = null;
        InputStream stream = null;
        try {
            stream = new FileInputStream(new File(sourceFileName));
            document = PDDocument.load(stream);
            if (workplace == null) {
                workplace = document;
            } else {
                merger.appendDocument(workplace, document);
            }
        } finally {
            if (document != null && document != workplace) {
                document.close();
            }
            if (stream != null) {
                stream.close();
            }
        }
    }

    OutputStream out = null;
    try {
        out = new FileOutputStream(new File(destinationFileName));
        workplace.save(out);
    } finally {
        if (out != null) {
            out.close();
        }
    }
} catch (COSVisitorException e1) {
    e1.printStackTrace();
} catch (IOException e) {
    e.printStackTrace();
} finally {
    if (workplace != null) {
        try {
            workplace.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Code to do it with iText. Notice how it loads the input file page by page and transfers each page to the output:

Document document = null;
PdfReader reader = null;
InputStream inputStream = null;
FileOutputStream outputStream = null;
try {
    inputStream = new FileInputStream(new File(sourceFileName));
    outputStream = new FileOutputStream(new File(destinationFileName));
    document = new Document();
    PdfCopy copy = new PdfSmartCopy(document, outputStream);
    document.open();
    reader = new PdfReader(inputStream);
    // loop over the pages in that document
    int pdfPageNo = reader.getNumberOfPages();
    for (int page = 0; page < pdfPageNo;) {
        PdfImportedPage onePage = copy.getImportedPage(reader, ++page);
        // duplicate each page N times
        for (int i = 0; i < COPIES; ++i) {
            copy.addPage(onePage);
        }
    }
    copy.freeReader(reader);
} catch (DocumentException e) {
    e.printStackTrace();
} catch (IOException e) {
    e.printStackTrace();
} finally {
    if (reader != null) {
        reader.close();
    }
    if (document != null) {
        document.close();
    }
    try {
        if (inputStream != null) {
            inputStream.close();
        }
        if (outputStream != null) {
            outputStream.close();
        }
    } catch (IOException e) {
        // do nothing
    }
}

Both are surrounded by this:

public class Duplicate {

    /** The original PDF file */
    private static final String sourceFileName = "PDF_CI_US2CA.pdf";

    /** The resulting PDF file. */
    private static final String destinationFileName = "itext_output.pdf";
    private static final int COPIES = 2;

    public static void main(String[] args) {
            ...
        }
}
1
IMHO this is more a question about economics than a pure technical question. You have a working solution with iText, but you want to use PdfBox. This choice comes at a cost. I think you prefer PdfBox because of the ASL. However, as nobody pays for PdfBox, you shouldn't expect the library to be as fast, feature rich, complete,... I changed the iText license from MPL/AGPL to AGPL in 2009 because I needed to start generating revenue to ensure further development of the library. Without that revenue, iText would have died a slow death.Bruno Lowagie
@BrunoLowagie I understand what you are saying but since I am not an expert in either library, I found one working solution. Perhaps there is another solution using PDFBox that will create smaller PDF files. Perhaps not. Perhaps iText is just better in that regard for my needs. I just want some help from experts in either tool. This brings up the question, since you are the expert on iText, as to whether I have an optimal solution for creating duplicate pages in iText?Lee Meador
PdfSmartCopy keeps a hash of certain objects such as streams (in early version of iText) and font dictionaries (in recent versions) in memory. Whenever an object is reused, we add a reference to the original object rather than duplicating it (as would be the case when using PdfCopy instead of PdfSmartCopy). Acrobat does even a better job: Acrobat can merge different subsets of the same font into one (larger) subset. We don't support that (yet) because it involves rewriting entire content streams (not trivial + more CPU / memory needed).Bruno Lowagie

1 Answers

8
votes

Using the following solution, I was able to create a PDF file with many duplicate pages and have a minimal impact on storage.

PDDocument samplePdf = null;
try {
    samplePdf = PDDocument.load(PDF_PATH);
    PDPage page = (PDPage) samplePdf.getDocumentCatalog().getAllPages().get(0); 

    for(int i = 0; i < COPIES; i++) {
        samplePdf.importPage(page);
    }

    samplePdf.save(SAVE_PATH); //$NON-NLS-1$

} catch (IOException e) {
    e.printStackTrace();
} catch (COSVisitorException e) {
    e.printStackTrace();
}

In my first attempt I used, samplePdf.addPage(page) but it didn't work as expected. So obviously there is a difference between the add and import functions. I'll have to check the source or documentation to see why. Anyway, this should help you devise a solution for your needs with PDFBox.