3
votes

I'm trying to write a PDF parser in C# but I've run into an issue where I'm unsure how to interpret the specification.

Unless otherwise specified user space in a PDF document is 1/72 of an inch (i.e. 1pt).

The scale provided by the Tf operator scales the font from the standard size (generally 1 unit of user space / 1pt) to the correct display size.

I have the following page content:

1 0 0 -1 0 792 cm
q
0 0 612 792 re
W* n
q
.75 0 0 .75 0 0 cm
1 1 1 RG 1 1 1 rg
/G0 gs
0 0 816 1056 re
f
0 0 816 1056 re
f
0 0 816 1056 re
f
Q
Q
q
0 0 612 791.25 re
W* n
q
.75 0 0 .75 0 0 cm
1 1 1 RG 1 1 1 rg
/G0 gs
0 0 816 1055 re
f
0 96 816 960 re
f
0 0 0 RG 0 0 0 rg
BT
/F0 21.33 Tf
1 0 0 -1 0 140 Tm
96 0 Td <0037> Tj
13.0280762 0 Td <004B> Tj
11.8616943 0 Td <004C> Tj
4.7384338 0 Td <0056> Tj
ET
BT
/F1 21.33 Tf
1 0 0 -1 0 140 Tm
136.292267 0 Td <0001> Tj
ET
...

I know that the font size in points of the 2 text operations defined in the sample is 16pt however the Tf operator is using a size of 21.33. In order to convert from this font size back to points I was intending to use the scale (y) of the cm operator making the point size:

21.33 * 0.75 = 15.9975

However I could find nothing in the PDF specification supporting this conversion and none of the libraries I checked (PDFBox, iTextSharp, Spire PDF) listed the font size as anything but 21.33.

Should I use the CTM (as defined by the cm operator) to scale the font size back to the correct scale or is this just pure chance?

The pdf file is here: https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig.Tests/Integration/Documents/Single%20Page%20Simple%20-%20from%20google%20drive.pdf

1
cm operations concatenate to each other, so yes, the factor 0.75 in the first scale operation is still "valid" when the Tf operator is processed. It's not really a conversion; all graphic operations are done using matrices.Jongware
Is there a way to represent the scaling of the font size as a matrix operation, since it's a scalar value it's not possible to multiply it by the matrix? In the example in the question the value of scaleX = 0.75 and the value of scaleY = -0.75 (negative) so it only makes sense to multiply by the X scale but I can't figure out the justification for doing so.Underscore
This is how we ended up calculating point size given the various transformation matrices. github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig/… I'm still not 100% sure it's correct in every scenario but it seems 'close enough' for most purposes.Underscore

1 Answers

3
votes

First of all, your comparison with other text extractors is based on a misunderstanding:

none of the libraries I checked (PDFBox, iTextSharp, Spire PDF) listed the font size as anything but 21.33.

The "font size" parameter returned by all those libraries simply is the size argument of the Tf instruction, not the effective font size your observe in the final document which you are trying to determine. So your comparison with other libraries does not make sense.


Now, concerning your approach:

In order to convert from this font size back to points I was intending to use the scale (y) of the cm operator making the point size:

21.33 * 0.75 = 15.9975

While some libraries call it so, calling the fourth cm parameter "scale (y)" is misleading. E.g. in case of text rotated by 90° it usually is null while the graphic representation usually is not reduced to zero height.

Thus, merely using the "scale (y)" parameter does not work, you have to take the whole transformation into account.


Eventually let's discuss what you actually are after.

As long as the combined transformation matrix (current transformation matrix + text matrix + horizontal scaling) is orthogonal and text lines are following this orthogonality, the meaning of your notion of font size is fairly obvious.

But as soon as there is a shearing in that combined matrix, the meaning of "font size" is not obvious anymore.

  • You might mean the length of what an originally vertical line (one unit high) is transformed into.
  • You might mean the length of the projection of that transformed line onto a line at a right angle to the transformed font base line.
  • Or you might mean the length of the projection of that transformed line onto a line at a right angle to an observed base line.

The former two numbers are trivial to calculate using simple linear algebra. The third number may be more difficult because you have to determine the base line observed by humans in the resulting PDF. In case of innovative use of transformations this might be non-trivial