Glyph to unicode string translation

Question

Given a glyph index for a specific font, I need to get the unicode translation of the glyph. in order to build a glyph-to-unicode translation I'm using GetGlyphIndices for the whole unicode range and from the result I build the reverse translation (glyph to unicode character map). However, this gives me a translation between a single glyph to a single unicode character, and I can see that in Hindi for example, two unicode characters can be represented by one glyph.

For example, in the word namaste (नमस्ते) there are 6 unicode characters which are represented by 5 glyphs (the middle two unicode characters are represented by one glyph). I can see this by attaching to notepad.exe, inserting a breakpoint in ExtTextOut and printing this word from notepad.

Is there any way I can translate a glyph to a unicode string (in case the glyph represents more than one unicode character)?

I have posted an answer here, but I'm curious why you think you need to do this? You have the input Unicodes already; why do you need to map back to them from glyphs? — djangodude
thanks for the answer! I'll check out the resources. actually I don't have the input unicodes, only the glyphs (I'm hooking ExtTextOut and from the hooked function I want to go back from the given glyph to the unicode characters) — user2975779
I'm having a hard time understanding a situation where you'd only have access to the output glyphs. Surely there is somewhere in your process where there is an input string (Unicode) i.e. in your example above, hook earlier in the process, obtaining the input string (lpString, cbCount) just before ExtTextOut is called? Maybe you could explain in more detail the whole process and where your code fits in? — djangodude
this is true, I also tried this approach - but not in all cases I managed to understand where the input string is translated into glyphs. according to the ExtTextOut documentation, glyphs are obtained by calling GetCharacterPlacement. however, when I debug applications (e.g. IE, Chrome) I see that they don't call this function, and I'm not sure how exactly they obtain the glyphs from the unicode string. — user2975779

djangodude djangodude · Accepted Answer · 2014-06-19T18:50:36

1) For all but very simple cases, you should use Uniscribe functions (not GetGlyphIndices) for converting a string (sequence of Unicodes) into glyphs. This is noted in the documentation for GetGlyphIndices: http://msdn.microsoft.com/en-us/library/windows/desktop/dd144890(v=vs.85).aspx

2) There is no way to reliably do what you want to do for all cases. Even for most cases. This is the result of something known as complex script shaping, which translates a sequence of input Unicodes into a sequence of output glyphs. This is done using a number of tables in the font data. The two of most interest are the cmap and the GSUB.

The cmap maps Unicode values to font-specific glyphs. The cmap may specify multiple Unicodes mapping to a single glyph (multi-mapping). This is a commonly-used scheme in many fonts. Also, many glyphs in the font may not even be mapped in the cmap. Thus with this alone, you cannot reliably reverse-map a glyph to a single Unicode.

But it gets even more difficult: the GSUB may specify numerous rules and may convert one input glyph to many output glyphs, or a series of input glyphs into one output glyph. It can even specify contexts under which the conversion will occur (for example, it could say something like "convert 'A' to 'B' but only when the 'A' is preceded by a 'C'", so CA -> CB but DA -> DA). In some cases, specifically with Hindi and other Indic languages, the output glyph sequence may even be in a different order than the logical Unicode input sequence. The net result is that the output sequence of glyphs may map back to a single Unicode, or multiple Unicodes, or none at all. It may be possible to decode the rules of the GSUB + the logic of the script-shaping engine to narrow things down a bit (an adventure not suitable for the weak of spirit!), but the problem is still that multiple input Unicodes could end up resolving to the same output glyph.

Bottom line: it's best to view the process of converting a string -> font-specific glyphs as a one-way trip.

For a better understanding of these concepts, I strongly recommend that you read up on complex script shaping as implemented in Windows: http://www.microsoft.com/typography/otspec/TTOCHAP1.htm . As for coding in an application, the Uniscribe reference is also very informative: http://msdn.microsoft.com/en-us/library/windows/desktop/dd374091(v=vs.85).aspx

Glyph to unicode string translation

1 Answers