0
votes

the following code demonstrates a very strange bug. Once the "source" file is closed the "destination" file can not be saved and closed, it will throw "java.io.IOException: COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed?"

If we comment out saving the source file, then the destination will save and close properly. This seems to clearly indicate that the source file contained a COSStream object that also existed in the destination file. The source file COSStream seems to get closed when we close the source file and then the destination can't be saved.

If we comment out flattening the source files AcroForm then the destination will save and close properly.

This simplistic example is trying to merge one copy of the form with itself, the bug will reproduce if you substitute certain other PDF files (all government forms that used to be XFA documents). Most PDFs will work in this scenario. We down converted the XFA documents to normal PDF to eliminate that as a variable and the bug still persisted.

Issue exists in PDFBox version 2.0.8 and older

    @Test
    public void testMergeGovernmentForms() throws Exception {
    File file = new File("GeneralForbearance.pdf");
    PDDocument destination = PDDocument.load(file);

    PDDocument source = PDDocument.load(file);
    source.getDocumentCatalog().getAcroForm().flatten(); //comment out just this line and the destination.save will pass

    PDFMergerUtility appender = new PDFMergerUtility();

    appender.appendDocument(destination, source);

    source.close(); //comment out just this line and the destination.save will pass

    destination.save(File.createTempFile("PrintMergeIssue", ".pdf"));
    destination.close();

}

Download the GeneralForbearance.pdf from HERE

Additionally, if you "pre-flatten" the government form and save, you get the same behavior with even simpler code.

    @Test
public void testMerge() throws Exception {
    PDFMergerUtility pdfMergerUtility = new PDFMergerUtility();
    PDDocument src = PDDocument.load(new File("C:/temp/GovFormPreFlattened.pdf"));
    PDDocument dest = PDDocument.load(new File("C:/temp/GovFormPreFlattened.pdf"));
    pdfMergerUtility.appendDocument(dest, src);
    src.close(); //if we don't close the src then we don't have an error
    dest.save(File.createTempFile("MergeIssue",".PDF"));
    dest.close();
}

The pre-flattened government form can be found HERE

1
"Additionally, if you "pre-flatten" the government form and save, you get the same behavior with even simpler code." - That is really interesting. In the original version I thought something like the observed behavior was to be expected or at least possible, but that the same happens with the simpler code is really interesting.mkl
This happens to save space... although I thought that with merging it doesn't happen, because of deep cloning.Tilman Hausherr
We are merging thousands of individual documents into bulk PDF files to send to the mail room. Are we supposed to keep all the source documents open while we populate the output? Is it intended behavior that closing a source PDF would close a COSStream in the output file? It seems to me that an object was not cloned properly when it was merged from source to destination. I appreciate your help.DavesPlanet
As @Tilman indicated in his comment, he thought that should not be necessary at least in pure merging use cases, i.e. in your simpler code.mkl
My current suspicion is that it is related to not removing the fields from the structure tree when flattening. That one is optional (describes the logical structure of the document) and you can delete it with document.getDocumentCatalog().setStructureTreeRoot(null);Tilman Hausherr

1 Answers

0
votes

There were several issues in the PDFMergerUtility merging tagged documents which I fixed, patched, and submitted back to Apache as issue PDFBOX-3999, download the patch from there if you need it. This patch applies to version 2.0.8