I'm using PDFsharp to merge a lot of files (stored on disk) into one PDF. Sometimes the PDF can be as large as 700MB. I'm using the sample code provided that basically creates an output PdfDocument
, adds pages to it, and then calls outputDocument.Save(destinationPath), so the amount of memory used is about the same as the size of documents produced.
Is there a way to instead stream the changes to a file to avoid the memory consumption? If not, would there be a way to do it leveraging MigraDoc?
UPDATE
Based on suggestion in comment, I put together some code to close and re-open document, and while memory use is under control and the file does grow, it doesn't seem to be appending pages. If I make it so "paths" is a list of 3000 single page files, I still get a 500 page document. Here is the code:
var destinationFile = "c:\\test.pdf";
var directory = Path.GetDirectoryName(destinationFile);
if (!Directory.Exists(directory))
{
Directory.CreateDirectory(directory);
}
var fs = new FileStream(destinationFile, FileMode.OpenOrCreate, FileAccess.ReadWrite);
var outputDocument = new PdfDocument(fs);
var count = 0;
// Iterate files (paths is a List<string> collection)
foreach (string path in paths)
{
var inputDocument = PdfReader.Open(path, PdfDocumentOpenMode.Import);
// Iterate pages
for (int idx = 0; idx < inputDocument.PageCount; idx++)
{
// Get the page from the external document...
PdfPage page = inputDocument.Pages[idx];
// ...and add it to the output document.
outputDocument.AddPage(page);
}
inputDocument.Dispose();
count++;
if (count % 500 == 0 || count == paths.Count)
{
outputDocument.Close();
fs.Close();
fs.Dispose();
if (count < paths.Count)
{
fs = new FileStream(destinationFile, FileMode.Append, FileAccess.Write);
outputDocument = new PdfDocument(fs);
}
}
}
UPDATE 2
Here is some new code that closes and re-opens the document using PDFReader. Program is merging 2000 4 page 140KB PDFs, output file is 273MB. I tried it without closing and re-opening, closing and re-opening every 1000, 500, 250, and 100 files. Results were as follows:
No interval: 21 seconds, max memory 330MB 1000 interval: 30 seconds, max memory 490MB 500 interval: 55secs, max memory 710MB 250 interval: 1min 35sec, max memory 780MB 100 interval: 2min 55secs, max memory 850mb
class Program
{
public static void Main(string[] args)
{
var files = new List<string>();
var basePath = AppDomain.CurrentDomain.BaseDirectory;
for (var i = 0; i < 2000; i++)
{
files.Add($"{basePath}\\sample.pdf");
}
DoMerge(files, $"{basePath}\\output.pdf");
}
private static void DoMerge(List<string> paths, string destinationFile)
{
var directory = Path.GetDirectoryName(destinationFile);
if (!Directory.Exists(directory))
{
Directory.CreateDirectory(directory);
}
var outputDocument = new PdfDocument();
var count = 0;
// Iterate files
foreach (string path in paths)
{
// Open the document to import pages from it.
try
{
var inputDocument = PdfReader.Open(path, PdfDocumentOpenMode.Import);
// Iterate pages
for (int idx = 0; idx < inputDocument.PageCount; idx++)
{
// Get the page from the external document...
PdfPage page = inputDocument.Pages[idx];
// ...and add it to the output document.
outputDocument.AddPage(page);
}
inputDocument.Dispose();
count++;
if (count % 500 == 0 || count == paths.Count)
{
outputDocument.Save(destinationFile);
outputDocument.Close();
outputDocument.Dispose();
if (count < paths.Count)
{
outputDocument = PdfReader.Open(destinationFile, PdfDocumentOpenMode.Import);
}
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
Console.WriteLine(ex.StackTrace);
}
}
}
}
PdfReader.Open
to open the output document to add more pages to it. – I liked the old Stack Overflow