3
votes

I am trying to implement full text search using Quartz 2D but it's a nightmare. I can "extract" text from pdf page using PDF Operator (TJ and other...)

    CGPDFOperatorTableRef myTable;

myTable = CGPDFOperatorTableCreate();

CGPDFOperatorTableSetCallback (myTable, "BT", &op_BT);
CGPDFOperatorTableSetCallback (myTable, "Td", &op_Td);
CGPDFOperatorTableSetCallback (myTable, "TD", &op_TD);
CGPDFOperatorTableSetCallback (myTable, "Tm", &op_Tm);
CGPDFOperatorTableSetCallback (myTable, "T*", &op_T);
CGPDFOperatorTableSetCallback (myTable, "TJ", &op_TJ);
CGPDFOperatorTableSetCallback (myTable, "Tf", &op_TF);
CGPDFOperatorTableSetCallback (myTable, "ET", &op_ET);

But in the same time I need to highlight a match on PDF page with some rectangle like it's done in Safari for example. Any suggestions how to implement this? Is there some solutions that don't require to such immense work?

1

1 Answers

4
votes

This is only the tip of the iceberg...

Detecting the "bytes" encoded in a TJ does not mean that you have already "text" or even are able to convert it back at all.

In PDF upon drawing text there's an "active" font (Tf). The font has an encoding - there are a lot of different encodings around and some are not "invertible" in the sense that you can get a unicode from it.

If you have an "invertible" encoding that's fine. It is still much work to implement the reverse lookup (especially for the multi byte encodings..) but one fine day you're done.

If your encoding is not so smart, you may still have an additional /ToUnicode map that allows to compute a unicode. An additional effort, but now your fine.

...besides the many existing documents around that support neither of these mappings to unicode...

...and after all: PDF does not contain "text" in that sense, it draws characters. So in theory you must draw the characters in a virtual page before you can sort them in any readable order...

All in all, its much fun.