0
votes

I'm using PDFsharp to merge a lot of files (stored on disk) into one PDF. Sometimes the PDF can be as large as 700MB. I'm using the sample code provided that basically creates an output PdfDocument, adds pages to it, and then calls outputDocument.Save(destinationPath), so the amount of memory used is about the same as the size of documents produced.

Is there a way to instead stream the changes to a file to avoid the memory consumption? If not, would there be a way to do it leveraging MigraDoc?

UPDATE
Based on suggestion in comment, I put together some code to close and re-open document, and while memory use is under control and the file does grow, it doesn't seem to be appending pages. If I make it so "paths" is a list of 3000 single page files, I still get a 500 page document. Here is the code:

var destinationFile = "c:\\test.pdf";
var directory = Path.GetDirectoryName(destinationFile);

if (!Directory.Exists(directory))
{
    Directory.CreateDirectory(directory);
}

var fs = new FileStream(destinationFile, FileMode.OpenOrCreate, FileAccess.ReadWrite);
var outputDocument = new PdfDocument(fs);
var count = 0;

// Iterate files (paths is a List<string> collection)
foreach (string path in paths)
{
    var inputDocument = PdfReader.Open(path, PdfDocumentOpenMode.Import);

    // Iterate pages
    for (int idx = 0; idx < inputDocument.PageCount; idx++)
    {
        // Get the page from the external document...
        PdfPage page = inputDocument.Pages[idx];
        // ...and add it to the output document.
        outputDocument.AddPage(page);
    }

    inputDocument.Dispose();
    
    count++;
    if (count % 500 == 0 || count == paths.Count)
    {
        outputDocument.Close();
        fs.Close();
        fs.Dispose();

        if (count < paths.Count)
        {
            fs = new FileStream(destinationFile, FileMode.Append, FileAccess.Write);
            outputDocument = new PdfDocument(fs);
        }
    }
}

UPDATE 2
Here is some new code that closes and re-opens the document using PDFReader. Program is merging 2000 4 page 140KB PDFs, output file is 273MB. I tried it without closing and re-opening, closing and re-opening every 1000, 500, 250, and 100 files. Results were as follows:

No interval: 21 seconds, max memory 330MB 1000 interval: 30 seconds, max memory 490MB 500 interval: 55secs, max memory 710MB 250 interval: 1min 35sec, max memory 780MB 100 interval: 2min 55secs, max memory 850mb

class Program
{
    public static void Main(string[] args)
    {
        var files = new List<string>();
        var basePath = AppDomain.CurrentDomain.BaseDirectory;

        for (var i = 0; i < 2000; i++)
        {
            files.Add($"{basePath}\\sample.pdf");
        }
        DoMerge(files, $"{basePath}\\output.pdf");
    }

    private static void DoMerge(List<string> paths, string destinationFile)
    {

        var directory = Path.GetDirectoryName(destinationFile);

        if (!Directory.Exists(directory))
        {
            Directory.CreateDirectory(directory);
        }

        var outputDocument = new PdfDocument();
        var count = 0;

        // Iterate files
        foreach (string path in paths)
        {
            // Open the document to import pages from it.
            try
            {
                var inputDocument = PdfReader.Open(path, PdfDocumentOpenMode.Import);

                // Iterate pages
                for (int idx = 0; idx < inputDocument.PageCount; idx++)
                {
                    // Get the page from the external document...
                    PdfPage page = inputDocument.Pages[idx];
                    // ...and add it to the output document.
                    outputDocument.AddPage(page);
                }

                inputDocument.Dispose();
                
                count++;
                if (count % 500 == 0 || count == paths.Count)
                {
                    outputDocument.Save(destinationFile);
                    outputDocument.Close();
                    outputDocument.Dispose();

                    if (count < paths.Count)
                    {
                        outputDocument = PdfReader.Open(destinationFile, PdfDocumentOpenMode.Import);
                    }
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);
                Console.WriteLine(ex.StackTrace);
            }
        }
    }
}
1
You have to use PdfReader.Open to open the output document to add more pages to it.I liked the old Stack Overflow
@IlikedtheoldStackOverflow I tried that too, using both PdfDocumentOpenMode.Modify and PdfDocumentOpenMode.Import. The memory use never goes down. I honestly wouldn't expect it to, doesn't PdfReader.Open() load the entire document in memory. So if the document keeps growing, every time I open it it'll still use more memory, no?Rocket04
PDFsharp initializes some members when they are used. Freshly added pages require more memory than "idle" pages from an opened document. It is a proven trick to close and re-open the destination file. But while running in 32-bit mode, the limit is still 2 GiB. Maybe the interval of 500 files is too long.I liked the old Stack Overflow

1 Answers

0
votes

To reduce the memory footprint, you can close the destination file from time to time, then open it again, and append more PDF files to it.
PDFsharp does not support swapping data to a file.

Make sure your app runs in 64-bit mode to allow it to use more than 2 GiB of RAM.