2
votes

Pardon the long posting, but whomever might answer this question will undoubtedly need all the information provided.

I have a somewhat successful implementation of PDF parsing for a project I'm working on. In chasing down an issue in exact text positioning in a particular PDF, I found my calculation to be wrong. Here are the relevant snippets of the PDF (using qpdf -qdf):

Relevant PDF snippets

%% Page 1
8 0 obj
<<
  /Contents [
    10 0 R
    12 0 R
  ]
  /MediaBox [ 0 0 612 792 ]
  /Type /Page
>>
endobj

=====

9 0 obj
<<
  /MediaBox [ 0 0 595 842 ]
  /Type /Pages
>>
endobj

======

%% Contents for page 1
10 0 obj
<<
  /Length 11 0 R
>>
stream
q
.94062 0 0 .94062 26.16627 0 cm
0 0 m
595 0 l
595 842 l
0 842 l
0 0 l
h
W n
q
16.84 16.84 561.32 808.32 re
W n
q
.24 0 0 .24 90.75993 740.44 cm
BT
133 0 0 133 0 0 Tm /TT1.0 1 Tf .0017 Tc [(The)1( )1(Long )1(Tai)1(l)]TJ
ET

I've removed extraneous portions of each snippet.

The following details my calculation of the exact PDF rect for the T in the first word on page 1.

Descriptions and effects of ops in stream of last PDF snippet above

q : Push graphics state (GS).

   ctm = identity

cm : Modify ctm

         |  0.94062   0        0 |
   ctm = |  0         0.94062  0 |
         | 26.16627   0        1 |

m : Start path at 0, 0

l : Line to 595, 0

l : Line to 595, 842

l : Line to 0, 842

l : Line to 0, 0

h : Close path

   No op since path already closed

n : Intersect GS clipping path

   GS clipping path wasn't set, so GS clipping path is the closed path that
   defines rect [0 0 595 842]

q : Push GS

re : Create closed path

  Path defines rect [16.84 16.84 561.32 808.32]

n : Intersect GS clipping path

   The incoming path is totally contained in current GS clipping path,
   so the GS clipping path becomes the incoming path, which defines the rect
   [16.84 16.84 561.32 808.32]

q : Push GS

cm : Modify ctm

         |  0.941   0      0 |   | 0.24     0    0 |   | 0.226   0     0 |
   ctm = |   0      0.941  0 | * | 0        0.24 0 | = | 0       0.226 0 |
         |  26.167  0      1 |   | 90.756 740.44 1 |   | 97.04 740.44  1 |

BT : Begin text object

Tm : Set text and line matrices

             | 133   0 0 |
   tm = lm = |   0 133 0 |
             |   0   0 1 |

Tf : Set text font and font size

   Set font to TT1, which has ascent=750, descent=-250, and glyph width T=667
   (I will only need the width of T below). Font size is set to 1.

   This has the effect of setting the first component in calculating the text
   rendering matrix to

      | 1   0    0 |
      | 0   1    0 |
      | 0  -0.25 1 |

Tc : Set text char spacing

TJ : Show text

Caculation of PDF rect for T

Our target text for calculations will be the first text for display, the letter T (hence the only glyph width provided above). To perform the text showing operator TJ, we first calculate Trm as specified in section 9.4.4 of PDF spec (PDF 32000-1:2008) :

  a = Tfs * Th = 1 * 1 = 1
  d = Tfs = 1
  ty = Trise = -250/1000 = -0.25

         |  a   0  0 |   
   Trm = |  0   d  0 | * Tm * ctm
         |  0  ty  1 |   

         | 1   0    0 |   | 133   0  0 |   | 0.226   0     0 |
   Trm = | 0   1    0 | * |   0 133  0 | * | 0       0.226 0 |
         | 0  -0.25 1 |   |   0   0  1 |   | 97.04 740.44  1 |

         | 30.06   0     0 |
       = |  0     30.06  0 |
         | 97.04 732.93  1 |

The glyph width (horizontal displacement) and glyph height are also calculated as per section 9.4.4:

  width  = width of 'T' / 1000 = 0.667
  height = (ascent - descent) / 1000 = (750 + 250) / 1000 = 1

The text rect for the letter T is:

  textRect for 'T' = [ 0 decent/1000, width, height ]
                   = [ 0 -0.25, 0.667, 1 ]

The T is rendered at the text rect transformed by the ctm. Calculating the resulting PDF rect for T:

PDF rect : 97.04 732.93 20.03 30.02

I have created highlight annotations for the T in both Acrobat and Preview and used that information to determine the calculation of the PDF position of T in both those programs as:

Acrobat : 111.54 718.99 20.03 30.02

Preview : 111.54 718.99 20.03 30.02

Compared to Acrobat and Preview, I have dx = -14.5 and dy = 13.31, that is, I'm too far left and too high of the actual position of the T.

For PDFs that do not have a changed Media Box or do not declare a graphics state clipping path all my calculations are spot on. I know it must be related to either the differing Media Box declarations in PDF objects 8 0 and 9 0 or, more probable, the graphics state clipping path resulting from the m, l, h, n, and re operators, which resulted in a rect clipping path rect of

[ 16.84 16.84 561.32 808.32 ]

while the Media Box was

[ 0 0 595 842 ]

I can't find any section in the PDF spec that indicates a change in my calculations due to the graphics state clipping path (again, assuming that is the culprit).

Ugh. What am I missing?

1

1 Answers

2
votes

An error is here:

cm : Modify ctm

         |  0.941   0      0 |   | 0.24     0    0 |   | 0.226   0     0 |
   ctm = |   0      0.941  0 | * | 0        0.24 0 | = | 0       0.226 0 |
         |  26.167  0      1 |   | 90.756 740.44 1 |   | 97.04 740.44  1 |

You multiply the change from the right to the existing transformation but you have to do it from the left.