split huge 40000 page pdf into single pages, itextsharp, outofmemoryexception

Question

I am getting huge PDF files with lots of data. The current PDF is 350 MB and has about 40000 pages. It would of course have been nice to get smaller PDFs, but this is what I have to work with now :-(

I can open it in acrobat reader with some delay when loading but after that acrobat reader is quick.

Now I need to split the huge file into single pages, then try to read some recipient data from the pdf pages, and then send the one or two pages that each recipient should get to each particular recipient.

Here is my very small code so far using itextsharp:

var inFileName = @"huge350MB40000pages.pdf";
PdfReader reader = new PdfReader(inFileName);
var nbrPages = reader.NumberOfPages;
reader.Close();

What happens is it comes to the second line "new PdfReader" then stays there for perhaps 10 minutes, the process gets to about 1.7 GB in size, and then I get an OutOfMemoryException.

I think the "new PdfReader" attempts to read the entire PDF into memory.

Is there some other/better way to do this? For example, can I somehow read only a part of a PDF file into memory instead of all of it at once? Could it work better using some other library than itextsharp?

Wolfram Alpha says that a 40,000 page document printed on both sides would be 80 inches tall - over 2m. — Cheeso
stackoverflow.com/questions/656351/… could be helpful to try another library or two to see if some have better read properties. — Tim Snowhite

Tim Friesen Tim Friesen · Accepted Answer · 2011-08-09T16:47:29

From what I have read, it looks like when instantiating the PdfReader that you should use the constructor that takes in a RandomAccessFileOrArray object. Disclaimer: I have not tried this out myself.

iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(new iTextSharp.text.pdf.RandomAccessFileOrArray(@"C:\PDFFile.pdf"), null);

split huge 40000 page pdf into single pages, itextsharp, outofmemoryexception

5 Answers