7
votes

I have a been trying to do a search feature in a PDF application. I read the Quartz 2d guide in iphone reference library. And so much has been said about the "pdf operators". It's by using them that everything is done, by using call-backs for them.

For info about pdf operators, we should read pdf reference of adobe. But it's very vast. Can anyone give me an idea of what these operators are (OR how to get an idea in studying them) and which of them I will require for my "search a string feature in pdf"?

3

3 Answers

7
votes

I've been searching for the same thing and today I found this post that has some clues:

http://www.random-ideas.net/posts/42

Looks like the operators are "TJ" and "Tj".

6
votes

Don't be scared off by the PDF reference. Its very well laid out and you really only need to read a few chapters to understand how text is handled. You can download it from Adobe:

Enrique is correct in that TJ and Tj are the operators that show text, but it is entirely possible, and even normal, for words and sentences to be split up across multiple operations. You should probably concentrate on text blocks, marked by BT and ET (begin text / end text) in the PDF Stream Object.

PDFBox from the Apache Project is a very full featured library for working with PDF documents, have a look there.

2
votes

There are four operators that show text, namely Tj, ', " and TJ. When you set up your operator table you must escape at least the double quotation mark like so.

CGPDFOperatorTableSetCallback(table, "\"", doubleQuot);

I did the same thing for the single quotation mark as well, just to be sure.

If you read chapter "9.4.3 Text-Showing Operators" from the reference document purecharger linked to carefully, you will see that the quotation mark operators are actually composed of multiple simpler operators like Tj, but you must scan for them anyways or you might miss some text.

All of these operators are always inside a BT context. You already noticed that the BT operator itself does not have any parameters, but if you keep track of the text matrix (only needed if you want to do positioning) then you should set it to the identity matrix.