Splitting a large Pdf file with PDFBox gets large result files

Question

I am processing some large pdf files, (up to 100MB and about 2000 pages), with pdfbox. Some of the pages contain a QR code, I want to split those files into smaller ones with the pages from one QR code to the next. I got this, but the result file sizes are the same as the source file. I mean, if I cut a 100MB pdf file into a ten files I am getting ten files 100MB each.

This is the code:

 PDDocument documentoPdf = 
        PDDocument.loadNonSeq(new File("myFile.pdf"), 
                           new RandomAccessFile(new File("./tmp/temp"), "rw"));

    int numPages = documentoPdf.getNumberOfPages();
    List pages = documentoPdf.getDocumentCatalog().getAllPages();

    int previusQR = 0;
    for(int i =0; i<numPages; i++){
       PDPage page = (PDPage) pages.get(i);
       BufferedImage firstPageImage =    
           page.convertToImage(BufferedImage.TYPE_USHORT_565_RGB , 200);

       String qrText = readQRWithQRCodeMultiReader(firstPageImage, hintMap);

       if(qrText != null and i!=0){
         PDDocument outputDocument = new PDDocument();
         for(int j = previusQR; j<i; j++){
           outputDocument.importPage((PDPage)pages.get(j));
          }
         File f = new File("./splitting_files/"+previusQR+".pdf");
         outputDocument.save(f);
         outputDocument.close();
         documentoPdf.close();
    }

I also tried the following code for storing the new file:

PDDocument outputDocument = new PDDocument();

for(int j = previusQR; j<i; j++){
 PDStream src = ((PDPage)pages.get(j)).getContents();
 PDStream streamD = new PDStream(outputDocument);
 streamD.addCompression();

 PDPage newPage = new PDPage(new   
           COSDictionary(((PDPage)pages.get(j)).getCOSDictionary()));
 newPage.setContents(streamD);

 byte[] buf = new byte[10240];
 int amountRead = 0;
 InputStream is = null;
 OutputStream os = null;
 is = src.createInputStream();
 os = streamD.createOutputStream();
 while((amountRead = is.read(buf,0,10240)) > -1) {
    os.write(buf, 0, amountRead);
  }

 outputDocument.addPage(newPage);
}

File f = new File("./splitting_files/"+previusQR+".pdf");

outputDocument.save(f);
outputDocument.close();

But this code creates files which lacks some content and also have the same size than the original.

How can I create smaller pdfs files from a larger one? Is it posible with PDFBox? Is there any other library with which I can transform a single page into an image (for qr recognition), and also allows me to split a big pdf file into smaller ones?

Thx!

What version are you using? Can you share the PDF? The effect you describe may happen if each page references all resources of all pages, instead of just the ones it is really using. — Tilman Hausherr
I am using 1.8.9 version (I am compiling with Java 1.6) You can download the file here I generated it using PDF_Chain — Nuria
Current version is 1.8.11 or 2.0 RC3. I tried the PDFSplit command utility with the first chunk, the result file (p 1- 59) is 1.7 MB. I'll try your code tonight to see if there's a difference. — Tilman Hausherr

Nuria Nuria · Accepted Answer · 2016-02-18T09:21:33

Thx! Tilman you are right, the PDFSplit command generates smaller files. I checked the PDFSplit code out and found that it removes the page links to avoid not needed resources.

Code extracted from Splitter.class :

private void processAnnotations(PDPage imported) throws IOException
    {
        List<PDAnnotation> annotations = imported.getAnnotations();
        for (PDAnnotation annotation : annotations)
        {
            if (annotation instanceof PDAnnotationLink)
            {
                PDAnnotationLink link = (PDAnnotationLink)annotation;   
                PDDestination destination = link.getDestination();
                if (destination == null && link.getAction() != null)
                {
                    PDAction action = link.getAction();
                    if (action instanceof PDActionGoTo)
                    {
                        destination = ((PDActionGoTo)action).getDestination();
                    }
                }
                if (destination instanceof PDPageDestination)
                {
                    // TODO preserve links to pages within the splitted result  
                    ((PDPageDestination) destination).setPage(null);
                }
            }
            else
            {
                // TODO preserve links to pages within the splitted result  
                annotation.setPage(null);
            }
        }
    }

So eventually my code looks like this:

PDDocument documentoPdf = 
        PDDocument.loadNonSeq(new File("docs_compuestos/50.pdf"), new RandomAccessFile(new File("./tmp/t"), "rw"));

        int numPages = documentoPdf.getNumberOfPages();
        List pages = documentoPdf.getDocumentCatalog().getAllPages();


        int previusQR = 0;
        for(int i =0; i<numPages; i++){
            PDPage firstPage = (PDPage) pages.get(i);
            String qrText ="";


            BufferedImage firstPageImage = firstPage.convertToImage(BufferedImage.TYPE_USHORT_565_RGB , 200);


            firstPage =null;

            try {
                qrText = readQRWithQRCodeMultiReader(firstPageImage, hintMap);
            } catch (NotFoundException e) {
                e.printStackTrace();
            } finally {
                firstPageImage = null;
            }


        if(i != 0 && qrText!=null){
                    PDDocument outputDocument = new PDDocument();
                    outputDocument.setDocumentInformation(documentoPdf.getDocumentInformation());
                    outputDocument.getDocumentCatalog().setViewerPreferences(
                            documentoPdf.getDocumentCatalog().getViewerPreferences());


                    for(int j = previusQR; j<i; j++){
                        PDPage importedPage = outputDocument.importPage((PDPage)pages.get(j));

                        importedPage.setCropBox( ((PDPage)pages.get(j)).findCropBox() );
                        importedPage.setMediaBox( ((PDPage)pages.get(j)).findMediaBox() );
                        // only the resources of the page will be copied
                        importedPage.setResources( ((PDPage)pages.get(j)).getResources() );
                        importedPage.setRotation( ((PDPage)pages.get(j)).findRotation() );

                        processAnnotations(importedPage);


                    }


                    File f = new File("./splitting_files/"+previusQR+".pdf");

                    previusQR = i;

                    outputDocument.save(f);
                    outputDocument.close();
                }
            }


        }

Thank you very much!!

Splitting a large Pdf file with PDFBox gets large result files

1 Answers