Extracting text from PDF

Question

I am attempting to extract text from a PDF file using the code found here. The code employs the zlib library.

AFAICT the program works by finding blocks of memory between occurrences of the text "stream" and "endstream" in the pdf file. These chunks are then inflated by zlib.

The code works perfectly on one sample pdf document, but on another it appears that the zlib's inflate() function returns -3 (Z_DATA_ERROR) every time it is called.

I noticed that, the pdf file that fails, is set so that when opened in Adobe reader, there is no "copy" option. Could this be related to the inflate() error?... and if it is, is there a way around the problem?

Code snippet below - see comments

            //Now use zlib to inflate:
            z_stream zstrm; ZeroMemory(&zstrm, sizeof(zstrm));

            zstrm.avail_in = streamend - streamstart + 1;
            zstrm.avail_out = outsize;
            zstrm.next_in = (Bytef*)(buffer + streamstart);
            zstrm.next_out = (Bytef*)output;

            int rsti = inflateInit(&zstrm);
            if (rsti == Z_OK)
            {
                int rst2 = inflate (&zstrm, Z_FINISH); // HERE IT RETURNS -3
                if (rst2 >= 0)
                {
                    //Ok, got something, extract the text:
                    size_t totout = zstrm.total_out;
                    ProcessOutput(fileo, output, totout);
                }
            }

EDIT: I tested text extraction from the "encrypted" pdf via an online pdf-to-text converter called zamzar, and the resulting text file was perfect. So either zamzar has some super-duper decrypting system... or perhaps its just not very difficult.

EDIT: Just found that A-pdf also converted to text without problems.

A sample document that causes the error would help. Using a debugger to figure out where the error is would help. — Robert Jacobs
The code from codeproject you reference is full of assumptions which sometimes are true and sometimes not. The fact that there is no "copy" option probably indicates that the PDF is encrypted to apply restrictions. It does not look like the codeproject code attempts decryption. So zlib tries to inflate encrypted data which obviously cannot work. A proper way around would be to use a proper PDF library. — mkl
Some of these libraries appear to be very complex to install and get running. I am reluctant to go through all that work without having some indication of the probability of them working. — Mick

plinth plinth · Accepted Answer · 2015-06-03T15:35:51

Streams in PDF need not be encoded with flate. They could be encoded with:

Nothing
LZW
Flate
ASCII85
Crypt (which could be one of several different algorithms)

And (surprise, surprise) any of these methods could also be layered on top of each other!

If there is no copy option, chances are it is encrypted with an owner password and no user password. This allows the author to create access permissions that are supposed to be honored by a reader including:

Modifying the document contents
Copying text/graphics
Adding/editing annotations
Printing
Form filling
Assembling the document (insert, delete pages, creating bookmarks, thumbnails)
High/low quality print

This particular approach to getting text out of a PDF is fraught with error and I can supply you with a set of documents that you won't be able to work with with your approach because of font re-encoding, split up text, oddball locations, form XObjects, unusual transformations, and so on.

To do this properly, you need a better set of tools that aren't blind to the actual format and structure of a PDF document. iText will do this, DotImage will do this.

To give you an idea of the scope of the problem, I wrote the original text search code in Acrobat 1.0 and with all the internal tools available to me, it took me many months to get it right and the code included the ability to find text in unusual, non-rectilinear orientations (think maps), handling ligatures, re-encoding, non-roman fonts, and so on. While I was working on that code, there was another engineer who was dedicated full time for several years writing code called Wordy to do something similar (but more complicated) for full-text extraction and indexing (see this answer for more information about Wordy).

Extracting text from PDF

3 Answers