14
votes

I am getting huge PDF files with lots of data. The current PDF is 350 MB and has about 40000 pages. It would of course have been nice to get smaller PDFs, but this is what I have to work with now :-(

I can open it in acrobat reader with some delay when loading but after that acrobat reader is quick.

Now I need to split the huge file into single pages, then try to read some recipient data from the pdf pages, and then send the one or two pages that each recipient should get to each particular recipient.

Here is my very small code so far using itextsharp:

var inFileName = @"huge350MB40000pages.pdf";
PdfReader reader = new PdfReader(inFileName);
var nbrPages = reader.NumberOfPages;
reader.Close();

What happens is it comes to the second line "new PdfReader" then stays there for perhaps 10 minutes, the process gets to about 1.7 GB in size, and then I get an OutOfMemoryException.

I think the "new PdfReader" attempts to read the entire PDF into memory.

Is there some other/better way to do this? For example, can I somehow read only a part of a PDF file into memory instead of all of it at once? Could it work better using some other library than itextsharp?

5
Wolfram Alpha says that a 40,000 page document printed on both sides would be 80 inches tall - over 2m. - Cheeso
Just of curiosity, what is this PDF ? - user703016
stackoverflow.com/questions/656351/… could be helpful to try another library or two to see if some have better read properties. - Tim Snowhite
@Cicada: It's probably the US Tax Code!! :P - Chris Dunaway
Its a set of invoices for a small public utility company. - tomsv

5 Answers

17
votes

From what I have read, it looks like when instantiating the PdfReader that you should use the constructor that takes in a RandomAccessFileOrArray object. Disclaimer: I have not tried this out myself.

iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(new iTextSharp.text.pdf.RandomAccessFileOrArray(@"C:\PDFFile.pdf"), null);
4
votes

This is a total shot in the dark, and I haven't tested this code - it's a code extract from the 'iText In Action' book that is given as an example of how to deal with large PDF files. The code is in Java but should be fairly easy to convert -

This is the method that loads everything into memory -

PdfReader reader;
long before;
before = getMemoryUse();
reader = new PdfReader(
"HelloWorldToRead.pdf", null);
System.out.println("Memory used by the full read: "
+ (getMemoryUse() - before));

This is the memory saving way, where the document should be loaded bit-by-bit as required -

before = getMemoryUse();
reader = new PdfReader(
new RandomAccessFileOrArray("HelloWorldToRead.pdf"), null);
System.out.println("Memory used by the partial read: "
+ (getMemoryUse() - before));
0
votes

You might be able to use Ghostscript directly. http://svn.ghostscript.com/ghostscript/tags/ghostscript-9.02/doc/Use.htm#One_page_per_file

For reading the recipient data pdftextstream might be a good choice.

0
votes

PDF Toolkit is quite useful for these types of tasks. Haven't tried it with such a huge file yet though.

0
votes

Could it work better using some other library than itextsharp?

Please try Aspose.Pdf for .NET which allows you to split the PDF into single pages or you could split the PDF to different sets of pages in various ways, either using files or memory streams. API is very simple to learn and use. It works with large PDF files having large number of pages.

Disclosure: I work as developer evangelist at Aspose.