0
votes

I have several PDF documents that supposedly contain scanned images, but upon inspection in Acrobat Pro, each page contains a huge number of tiny "inline images". From what I understand these are not regular images inside XObjects, but rather images embedded directly inside content streams.

How could I go about extracting and merging these images?

The only code I could find online starts out like this:

var reader = new PdfReader(@"path\to\file.pdf");
PdfDocument document = new PdfDocument(reader);

for (var i = 1; i <= document.GetNumberOfPages(); i++)
{
    PdfDictionary obj = (PdfDictionary)document.GetPdfObject(i);
    // ... more code goes here
}

...but the rest of the code doesn't work because the PdfDictionary returned from GetPdfObject is not a stream, only a dictionary. I don't know how to access the images inside it.

Perhaps I need to clarify that when I wrote "several PDF documents" I meant several thousands, and more on the way. I'm using a PDF printer to help my clients convert many different formats to PDF and the documents created are almost impossible to handle due to the complexity of 10,000+ images each one the size of 1-5 pixels. I MUST find a way to automatically restore these documents to sanity. - user884248
Extracting and merging those images may be pretty complex depending on the shape of those images, potential transparency/masks etc. The best way would probably be to render your PDFs into a raster image in high quality and then use that rendered representation - Alexey Subach