Precise bounding box for glyphs in a PDF?

Question

I'm trying to calculate the exact bounding box of every text glyph in a vector PDF.

This involves keeping track of the CTM, drawing/positioning PDF instructions, etc., but also calculating the boundaries of every specific glyph in "glyph space" (using the information from the GLYF tables in the embedded fonts).

I realize the PDF FontDescriptor includes a rough bounding box for each embedded font, but that's a composite of all glyphs in the font -- i.e., the smallest bounding box that fits all glyphs in the font. For my purposes, I need more precise positioning.

My specific application is extracting the musical semantics from a vector PDF of sheet music. As such, one nice constraint is that I can assume glyphs aren't drawn together in the same Tj/TJ operator. Each glyph is drawn independently.

Also, note that I'm defining bounding box as "the smallest box that can contain all the drawn parts of the glyph." There's no need to ignore the ascenders/descenders/etc. that might be considered "outside" the bounding box in other applications.

There are many moving parts here, and I've found it's quite hard to debug. So here's what I'd love help with:

This example PDF I've created has 10 glyphs. What is the "ground truth" bounding box positioning for these 10 glyphs, in device space? My current code produces the following, but it's incorrect. I know it's incorrect because it says the first glyph ("&") horizontally intersects the second ("\u02d9"), which you can see isn't true when you view the PDF in a PDF reader.

'&'      ( 57.2799755477664, 600.7092061684704,  86.7452642315424, 677.1570718099680)
'\u02d9' ( 82.0030393188000, 633.6851606704608,  96.3090818379936, 644.6969866323168)
'\u0153' (144.7841941848000, 623.9630080194528, 158.6735558539200, 634.5581702962656)
'\u0153' (181.6778111184000, 619.0027260546528, 195.5671727875200, 629.5978883314656)
'w'      (226.1671727148000, 611.3638918288608, 245.0765465300448, 622.3161944071392)
'w'      (320.1063822180000, 631.2050196880608, 339.0157560332448, 642.1573222663392)
'\u0153' (414.0455917212000, 641.3239948962528, 427.9349533903200, 651.9191571730656)
'\u0153' (450.9392086548000, 636.3637129314528, 464.8285703239200, 646.9588752082656)
'\u0153' (487.9878407856000, 631.4034309666528, 501.8772024547200, 641.9985932434656)
'\u0153' (524.8814577192000, 628.9232899842528, 538.7708193883200, 639.5184522610656)

How did you calculate those positions? (I realize this is a lot to ask, given complexity of PDF.) It would be a huge help to have a walkthrough, and I'm sure it would help others in the future.
Is there a tool that does this off the shelf?

I'm afraid A the description in the PDF specification already is quite good. You may want to ask specific questions or share your (tidied-up) code for analysis instead of waiting for someone re-formulating the spec. And B your starting positions look not too far off (differences might be due to a differing target coordinate system or chosen starting point) but the rectangles you span from them seem very odd. — mkl

KenS KenS · Accepted Answer · 2015-05-04T11:05:41

I believe the only way to get truly accurate information is to actually render the glyphs at the given point size and collect the extents of the resulting bitmap.

Even extracting the path describing the glyph won't give you completely accurate information because hinting can subtly (or in some case, not so subtly) alter the way the glyph is rendered. In any event extracting the path is as much work, possibly more, as rendering the bitmap.....

There are broadly three categories of font in PDF:

Fonts with PostScript outlines
Fonts with TrueType outlines
User defined fonts.

You can use FreeType to render glyphs from fonts with PostScript and TrueType outlines (you can also have it return the path if you would rather use that).

User defined (type 3) fonts you have to treat as a series of PDF operations, scaled by the text matrix. So you need to do that yourself.

Note that fonts can be organised in 2 ways, regular fonts and CIDFonts and the means for getting the glyph data corresponding to a character code differ between the two, but I assume you're already prepared to deal with that in your existing code.

Its possible that in your case you have a workflow which limits the kinds of fonts you might see, so you may not need a full implementation of all this. For example I see that you are using CIDFonts with TrueType outlines, but the CIDToGIDMap is /Identity, which reduces the scope of the problem.

For additional complexity, you will need to consider what represents the 'bounding box' of your glyph. Do you consider the advance width and left side-bearing to be part of the bounding box, or just the areas marked ?

Remember that PDF can specify different Widths for glyphs to those defined in the font, and both your fonts include /W arrays which modify the widths defined in the font.

If you consider the left side-bearing and advance width as part of the glyph, but have a /Widths array with a value smaller than the advance width it may be that two glyphs will appear to 'collide', but actually still have white space between them. All the /Widths has done is reduce the white space from the advance width so that the glyphs are closer together than would normally be the case.

I had a quick bash at this using MuPDF which gave the answers:

<span bbox="39.21884 163.68216 42.53509 163.99687" font="PlantinMTStd-Regular" size="11.935925">
<char bbox="39.21884 163.68216 42.53509 163.99687" x="39.21884" y="163.99687" c=" "/>

<span bbox="57.200607 163.69899 73.08967 165.2394" font="OpusStd" size="19.841537">
<char bbox="57.200607 163.69899 73.08967 165.2394" x="57.200607" y="165.2394" c="&amp;"/>

<char bbox="82.003044 151.29828 90.63545 152.83868" x="82.003044" y="152.83868" c="&#x2d9;"/>

<char bbox="144.7842 161.21884 153.1744 162.75925" x="144.7842" y="162.75925" c="&#x153;"/>

<char bbox="181.67781 166.17912 190.06801 167.71953" x="181.67781" y="167.71953" c="&#x153;"/>

<char bbox="226.16718 173.61955 236.8826 175.15996" x="226.16718" y="175.15996" c="w"/>

<char bbox="320.10638 153.77843 330.8218 155.31883" x="320.10638" y="155.31883" c="w"/>

<char bbox="414.0456 143.85785 422.4358 145.39825" x="414.0456" y="145.39825" c="&#x153;"/>

<char bbox="450.9392 148.81815 459.3294 150.35855" x="450.9392" y="150.35855" c="&#x153;"/>

<char bbox="487.98785 153.77843 496.37805 155.31883" x="487.98785" y="155.31883" c="&#x153;"/>

<char bbox="524.8815 156.25856 533.27167 157.79897" x="524.8815" y="157.79897" c="&#x153;"/>

And for completeness, here's the same information from Ghostscript using the txtwrite device with -dTextFormat=0:

<page>
<span bbox="39 164 43 164" font="PlantinMTStd-Regular" size="11.9357">
<char bbox="39 164 39 164" c=" "/>
</span>
<span bbox="57 165 73 165" font="OpusStd" size="19.8411">
<char bbox="57 165 57 165" c="&amp;"/>
</span>
<span bbox="82 153 91 153" font="OpusStd" size="19.8411">
<char bbox="82 153 82 153" c="&#x2d9;"/>
</span>
<span bbox="145 163 153 163" font="OpusStd" size="19.8411">
<char bbox="145 163 145 163" c="&#x153;"/>
</span>
<span bbox="182 168 190 168" font="OpusStd" size="19.8411">
<char bbox="182 168 182 168" c="&#x153;"/>
</span>
<span bbox="226 175 237 175" font="OpusStd" size="19.8411">
<char bbox="226 175 226 175" c="w"/>
</span>
<span bbox="320 155 331 155" font="OpusStd" size="19.8411">
<char bbox="320 155 320 155" c="w"/>
</span>
<span bbox="414 145 422 145" font="OpusStd" size="19.8411">
<char bbox="414 145 414 145" c="&#x153;"/>
</span>
<span bbox="451 150 459 150" font="OpusStd" size="19.8411">
<char bbox="451 150 451 150" c="&#x153;"/>
</span>
<span bbox="488 155 496 155" font="OpusStd" size="19.8411">
<char bbox="488 155 488 155" c="&#x153;"/>
</span>
<span bbox="525 158 533 158" font="OpusStd" size="19.8411">
<char bbox="525 158 525 158" c="&#x153;"/>
</span>
</page>

It does look like there's a bug there though, the urx value is incorrect in the char bbox, but correct in the span bbox.

Precise bounding box for glyphs in a PDF?

2 Answers