0
votes

Below approach i have used to extract images from pdf. But sub type is always giving null. I am working with iText7 library which is new version. If any body worked with new library please give suggestions.

    public static string ExtractImageFromPDF(string sourcePdf)
    {            
        PdfReader reader = new PdfReader(sourcePdf);
        try
        {
            PdfDocument document = new PdfDocument(reader);

            for (int pageNumber = 1; pageNumber <= document.GetNumberOfPages(); pageNumber++)
            {
                PdfDictionary obj = (PdfDictionary)document.GetPdfObject(pageNumber);

                if (obj != null && obj.IsStream())
                {
                    PdfDictionary pd = (PdfDictionary)obj;
                    if (pd.ContainsKey(PdfName.Subtype) && pd.Get(PdfName.Subtype).ToString() == "/Image")
                    {
                        string filter = pd.Get(PdfName.Filter).ToString();
                        string width = pd.Get(PdfName.Width).ToString();
                        string height = pd.Get(PdfName.Height).ToString();
                        string bpp = pd.Get(PdfName.BitsPerComponent).ToString();
                        string extent = ".";
                        byte[] img = null;
                        switch (filter)
                        {
                            case "/FlateDecode":
                                byte[] arr = FlateDecodeFilter.FlateDecode(null, true);
                                Bitmap bmp = new Bitmap(Int32.Parse(width), Int32.Parse(height), PixelFormat.Format24bppRgb);
                                BitmapData bmd = bmp.LockBits(new Rectangle(0, 0, Int32.Parse(width), Int32.Parse(height)), ImageLockMode.WriteOnly,
                                    PixelFormat.Format24bppRgb);
                                Marshal.Copy(arr, 0, bmd.Scan0, arr.Length);
                                bmp.UnlockBits(bmd);
                                bmp.Save("d:\\pdf\\bmp1.png", ImageFormat.Png);
                                break;
                            case "/CCITTFaxDecode":
                                break;
                            default:
                                break;
                        }
                    }
                }
            }
        }
        catch
        {
            throw;
        }
        return "";
    }
2
"it is returning null" nothing in the code you've posted returns null. - Ian Kemp
Sorry, sub type is always coming as null value. - sainath sagar
You mean pd.Get(PdfName.Subtype).ToString() == null? - Ian Kemp
Yes, it is giving null. - sainath sagar
the correct is document.GetPdfObject(objectNumber), not document.GetPdfObject(pageNumber) - Tomex Ou

2 Answers

0
votes

When you use Quickwatch on the pd value, what do you see is in there? The documentation of the iText 7 states is a dictionary, so perhaps you can check which types are available and find the appropriate field that you're looking for.

PdfDictionary pd = (PdfDictionary)obj;

Documentation can be found overhere: https://api.itextpdf.com/iText7/dotnet/7.1.8/classi_text_1_1_kernel_1_1_pdf_1_1_pdf_dictionary.html

0
votes

The idea of your approach is to check every indirect object in it whether it is an image XObject and extract the contained image data therein if it is.

Actually, though, you only iterate over the values 1..document.GetNumberOfPages() as object numbers, i.e. only over a fraction of the indirect objects of your document!

Indeed, there are more indirect objects in a PDF than there are pages, usually very many more.

Thus, iterate instead up to document.GetNumberOfPdfObjects()-1.