I'm trying to calculate the exact bounding box of every text glyph in a vector PDF.
This involves keeping track of the CTM, drawing/positioning PDF instructions, etc., but also calculating the boundaries of every specific glyph in "glyph space" (using the information from the GLYF tables in the embedded fonts).
I realize the PDF FontDescriptor includes a rough bounding box for each embedded font, but that's a composite of all glyphs in the font -- i.e., the smallest bounding box that fits all glyphs in the font. For my purposes, I need more precise positioning.
My specific application is extracting the musical semantics from a vector PDF of sheet music. As such, one nice constraint is that I can assume glyphs aren't drawn together in the same Tj/TJ operator. Each glyph is drawn independently.
Also, note that I'm defining bounding box as "the smallest box that can contain all the drawn parts of the glyph." There's no need to ignore the ascenders/descenders/etc. that might be considered "outside" the bounding box in other applications.
There are many moving parts here, and I've found it's quite hard to debug. So here's what I'd love help with:
- This example PDF I've created has 10 glyphs. What is the "ground truth" bounding box positioning for these 10 glyphs, in device space? My current code produces the following, but it's incorrect. I know it's incorrect because it says the first glyph ("&") horizontally intersects the second ("\u02d9"), which you can see isn't true when you view the PDF in a PDF reader.
'&' ( 57.2799755477664, 600.7092061684704, 86.7452642315424, 677.1570718099680)
'\u02d9' ( 82.0030393188000, 633.6851606704608, 96.3090818379936, 644.6969866323168)
'\u0153' (144.7841941848000, 623.9630080194528, 158.6735558539200, 634.5581702962656)
'\u0153' (181.6778111184000, 619.0027260546528, 195.5671727875200, 629.5978883314656)
'w' (226.1671727148000, 611.3638918288608, 245.0765465300448, 622.3161944071392)
'w' (320.1063822180000, 631.2050196880608, 339.0157560332448, 642.1573222663392)
'\u0153' (414.0455917212000, 641.3239948962528, 427.9349533903200, 651.9191571730656)
'\u0153' (450.9392086548000, 636.3637129314528, 464.8285703239200, 646.9588752082656)
'\u0153' (487.9878407856000, 631.4034309666528, 501.8772024547200, 641.9985932434656)
'\u0153' (524.8814577192000, 628.9232899842528, 538.7708193883200, 639.5184522610656)
How did you calculate those positions? (I realize this is a lot to ask, given complexity of PDF.) It would be a huge help to have a walkthrough, and I'm sure it would help others in the future.
Is there a tool that does this off the shelf?