Pardon the long posting, but whomever might answer this question will undoubtedly need all the information provided.
I have a somewhat successful implementation of PDF parsing for a project I'm working on. In chasing down an issue in exact text positioning in a particular PDF, I found my calculation to be wrong. Here are the relevant snippets of the PDF (using qpdf -qdf
):
Relevant PDF snippets
%% Page 1
8 0 obj
<<
/Contents [
10 0 R
12 0 R
]
/MediaBox [ 0 0 612 792 ]
/Type /Page
>>
endobj
=====
9 0 obj
<<
/MediaBox [ 0 0 595 842 ]
/Type /Pages
>>
endobj
======
%% Contents for page 1
10 0 obj
<<
/Length 11 0 R
>>
stream
q
.94062 0 0 .94062 26.16627 0 cm
0 0 m
595 0 l
595 842 l
0 842 l
0 0 l
h
W n
q
16.84 16.84 561.32 808.32 re
W n
q
.24 0 0 .24 90.75993 740.44 cm
BT
133 0 0 133 0 0 Tm /TT1.0 1 Tf .0017 Tc [(The)1( )1(Long )1(Tai)1(l)]TJ
ET
I've removed extraneous portions of each snippet.
The following details my calculation of the exact PDF rect for the T in the first word on page 1.
Descriptions and effects of ops in stream of last PDF snippet above
q : Push graphics state (GS).
ctm = identity
cm : Modify ctm
| 0.94062 0 0 |
ctm = | 0 0.94062 0 |
| 26.16627 0 1 |
m : Start path at 0, 0
l : Line to 595, 0
l : Line to 595, 842
l : Line to 0, 842
l : Line to 0, 0
h : Close path
No op since path already closed
n : Intersect GS clipping path
GS clipping path wasn't set, so GS clipping path is the closed path that
defines rect [0 0 595 842]
q : Push GS
re : Create closed path
Path defines rect [16.84 16.84 561.32 808.32]
n : Intersect GS clipping path
The incoming path is totally contained in current GS clipping path,
so the GS clipping path becomes the incoming path, which defines the rect
[16.84 16.84 561.32 808.32]
q : Push GS
cm : Modify ctm
| 0.941 0 0 | | 0.24 0 0 | | 0.226 0 0 |
ctm = | 0 0.941 0 | * | 0 0.24 0 | = | 0 0.226 0 |
| 26.167 0 1 | | 90.756 740.44 1 | | 97.04 740.44 1 |
BT : Begin text object
Tm : Set text and line matrices
| 133 0 0 |
tm = lm = | 0 133 0 |
| 0 0 1 |
Tf : Set text font and font size
Set font to TT1, which has ascent=750, descent=-250, and glyph width T=667
(I will only need the width of T below). Font size is set to 1.
This has the effect of setting the first component in calculating the text
rendering matrix to
| 1 0 0 |
| 0 1 0 |
| 0 -0.25 1 |
Tc : Set text char spacing
TJ : Show text
Caculation of PDF rect for T
Our target text for calculations will be the first text for display, the letter T (hence the only glyph width provided above). To perform the text showing operator TJ, we first calculate Trm as specified in section 9.4.4 of PDF spec (PDF 32000-1:2008) :
a = Tfs * Th = 1 * 1 = 1
d = Tfs = 1
ty = Trise = -250/1000 = -0.25
| a 0 0 |
Trm = | 0 d 0 | * Tm * ctm
| 0 ty 1 |
| 1 0 0 | | 133 0 0 | | 0.226 0 0 |
Trm = | 0 1 0 | * | 0 133 0 | * | 0 0.226 0 |
| 0 -0.25 1 | | 0 0 1 | | 97.04 740.44 1 |
| 30.06 0 0 |
= | 0 30.06 0 |
| 97.04 732.93 1 |
The glyph width (horizontal displacement) and glyph height are also calculated as per section 9.4.4:
width = width of 'T' / 1000 = 0.667
height = (ascent - descent) / 1000 = (750 + 250) / 1000 = 1
The text rect for the letter T is:
textRect for 'T' = [ 0 decent/1000, width, height ]
= [ 0 -0.25, 0.667, 1 ]
The T is rendered at the text rect transformed by the ctm. Calculating the resulting PDF rect for T:
PDF rect : 97.04 732.93 20.03 30.02
I have created highlight annotations for the T in both Acrobat and Preview and used that information to determine the calculation of the PDF position of T in both those programs as:
Acrobat : 111.54 718.99 20.03 30.02
Preview : 111.54 718.99 20.03 30.02
Compared to Acrobat and Preview, I have dx = -14.5 and dy = 13.31, that is, I'm too far left and too high of the actual position of the T.
For PDFs that do not have a changed Media Box or do not declare a graphics state clipping path all my calculations are spot on. I know it must be related to either the differing Media Box declarations in PDF objects 8 0 and 9 0 or, more probable, the graphics state clipping path resulting from the m, l, h, n, and re operators, which resulted in a rect clipping path rect of
[ 16.84 16.84 561.32 808.32 ]
while the Media Box was
[ 0 0 595 842 ]
I can't find any section in the PDF spec that indicates a change in my calculations due to the graphics state clipping path (again, assuming that is the culprit).
Ugh. What am I missing?