0
votes

I am trying to extract images using the PDFsharp library. As mentioned in the sample program, the library does not support the extraction of the non-JPEG images, therefore, I am trying to do it myself.

I found a non-working sample program for the same purpose. I am using the following code to extract a 400 x 400 PNG image embedded in a PDF file (the image was first inserted in a MS Word file, which was saved as a PDF file then).

PDF File Link:

https://drive.google.com/open?id=1aB-SrMB3eu00BywliOBC8AW0JqRa0Hbd

EXTRACTION CODE:

 static void ExportAsPngImage(PdfDictionary image, ref int count)
    {
        int width = image.Elements.GetInteger(PdfSharp.Pdf.Advanced.PdfImage.Keys.Width);
        int height = image.Elements.GetInteger(PdfSharp.Pdf.Advanced.PdfImage.Keys.Height);            
        System.Drawing.Imaging.PixelFormat pixelFormat = System.Drawing.Imaging.PixelFormat.Format8bppIndexed;           

        byte[] original_byte_boundary = image.Stream.UnfilteredValue;
        byte[] result_byte_boundary = null;           

        //Image data in BMP files always starts at a DWORD boundary, in PDF it starts at a BYTE boundary.            
        //You must copy the image data line by line and start each line at the DWORD boundary.

            byte[, ,] copy_dword_boundary = new byte[3, height, width];

        for (int y = 0; y < height; y++)
        {
            for (int x = 0; x < width; x++)
            {
                if (x <= width && (x + (y * width) != original_byte_boundary.Length))
                // while not at end of line, take orignale array
                {
                    copy_dword_boundary[0, y, x] = original_byte_boundary[3*x + (y * width)];
                    copy_dword_boundary[1, y, x] = original_byte_boundary[3*x + (y * width) + 1];
                    copy_dword_boundary[2, y, x] = original_byte_boundary[3*x + (y * width) + 2];
                }
                else //fill new array with ending 0
                {
                    copy_dword_boundary[0, y, x] = 0;
                    copy_dword_boundary[1, y, x] = 0;
                    copy_dword_boundary[2, y, x] = 0;
                }
            }
        }
        result_byte_boundary = new byte[3 * width * height];
        int counter = 0;
        int n_width = copy_dword_boundary.GetLength(2);
        int n_height = copy_dword_boundary.GetLength(1);

        for (int x = 0; x < width; x++)
        {
            for (int y = 0; y < height; y++)
            {   //put 3dim array back in 1dim array
                result_byte_boundary[counter] = copy_dword_boundary[0, x, y];
                result_byte_boundary[counter + 1] = copy_dword_boundary[1, x, y];
                result_byte_boundary[counter + 2] = copy_dword_boundary[2, x, y];

                //counter++;
                counter = counter + 3;
            }
        }


        Bitmap bmp = new Bitmap(width, height, pixelFormat);            
        System.Drawing.Imaging.BitmapData bmd = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.WriteOnly, bmp.PixelFormat);
        System.Runtime.InteropServices.Marshal.Copy(result_byte_boundary, 0, bmd.Scan0, result_byte_boundary.Length);
        bmp.UnlockBits(bmd);
        using (FileStream fs = new FileStream(@"D:\TestPdf\" + String.Format("Image{0}.png", count), FileMode.Create, FileAccess.Write))
        {
            bmp.Save(fs, ImageFormat.Png);
            count++;
        }
    }

PROBLEM:

Whatever PixelFormat format I choose, the saved PNG image does not look correct.

Original PNG IMAGE (Bit Depth-32):

enter image description here

Result of PixelFormat = Format24bppRgb

enter image description here

1
There is a number of options how the stream bits may be formatted, so a generic solution might well be beyond the scope of a stack overflow answer. How about looking into the code of an open source PDF library with an appropriate license which already has implemented an image export function, for inspiration? - mkl
@mkl: Could you suggest an open source library which can reliably extract images from a PDF. The library by Bit Miracle worked reliably for me but it's not open source. - skm
Your question is not about iText, I removed the tag. - Amedee Van Gasse
I don't do large scale image extraction, so I cannot talk about reliability. Furthermore, reliability might be a question of the types of images coming along: PDF allows many variations in the images...Furthermore be aware that depending on how exactly you let yourself be inspired, there might be consequences license-wise: If you simply copy non-trivial code, you may well become subject to the license of the source library you copy from. - mkl
@mkl: I understand that I cannot simply copy-paste the non-trivial code :) I just need some inspiration code to extract the images. I do not need other functionalities offered by the paid libraries. - skm

1 Answers

0
votes

You can get the pixelformat from the PDF file. Since you did not include the PDF in your post, I cannot tell you which format would be correct.

PDF files do not contain PNG images, instead images use a special PDF image format which is somewhat similar to the BMP files used by Windows, but without any headers in the binary data. Instead the "header" information can be found with the properties of the Image object. See the PDF Reference for further details.