How to map code points to unicode characters depending on the font used?

Question

The client prints labels and has been using a set of symbolic (?) fonts to do this. The application uses a single byte database (Oracle with Latin-1). The old application I am replacing was not Unicode aware. It somehow did OK. The replacement application I am writing is supposed to handle the old data.

The symbols picked from the charmap application often map to particular Unicode characters, but sometimes they don't. What looks like the Moon using the LAB3 font, for example, is in fact U+2014 (EM DASH). When users paste this character into a Swing text field, the character has the code point 8212. ~~It was "moved" into the Private Use Area (by Windows? Java?).~~ When saving this character to the database, Oracle decides that it cannot be safely encoded and replaces it with the dreaded ¿. Thus, I started shifting the characters by 8000: -= 8000 when saving, += 8000 when displaying the field. Unfortunately I discovered that other characters were not shifted by the same amount. In one particular font, for example, ž has the code point 382, so I shifted it by +/-256 to "fix" it.

By now I'm dreading the discovery of more strange offsets and I wonder: Can I get at this mapping using Java? Perhaps the TTF font has a list of the 255 glyphs it encodes and what Unicode characters those correspond to and I can do it "right"?

Right now I'm using the following kludge:

static String fromDatabase(String str, String fontFamily) {

  if (str != null && fontFamily != null) {
    Font font = new Font(fontFamily, Font.PLAIN, 1);

    boolean changed = false;
    char[] chars = str.toCharArray();
    for (int i = 0; i < chars.length; i++) {
      if (font.canDisplay(chars[i] + 0xF000)) {
        // WE8MSWIN1252 + WinXP
        chars[i] += 0xF000;
        changed = true;
      }
      else if (chars[i] >= 128 && font.canDisplay(chars[i] + 8000)) {
        // WE8ISO8859P1 + WinXP
        chars[i] += 8000;
        changed = true;
      }
      else if (font.canDisplay(chars[i] + 256)) {
        // ž in LAB1 Eastern = 382
        chars[i] += 256;
        changed = true;
      }
    }
    if (changed) str = new String(chars);
  }
  return str;
}

static String toDatabase(String str, String fontFamily) {

  if (str != null && fontFamily != null) {
    boolean changed = false;
    char[] chars = str.toCharArray();
    for (int i = 0; i < chars.length; i++) {
      int chr = chars[i];
      if (chars[i] > 0xF000) {
        // WE8MSWIN1252 + WinXP
        chars[i] -= 0xF000;
        changed = true;
      }
      else if (chars[i] > 8000) {
        // WE8ISO8859P1 + WinXP
        chars[i] = (char) (chars[i] - 8000);
        changed = true;
      }
      else if (chars[i] > 256) {
        // ž in LAB1 Eastern = 382
        chars[i] = (char) (chars[i] - 256);
        changed = true;
      }
    }
    if (changed) return new String(chars);
  }

  return str;
}

What exact font are you using, is it something default to Windows or otherwise commonly available? U+2014=8212 because 2014 is in hex, the code point wasn't moved. — Mark Ransom
These fonts appear to be custom designed for the customer and have names like "LAB1 Western", "LAB2 Cyrillic" and "LAB3 Baltish" etc. — Alex Schröder
I'll edit the question and remove the part about "moving" -- I conflated two issues (back when my database was using WE8MSWIN1252 instead of WE8ISO8859P1 I had characters that in the 0XF000 range, remnants of which you can still see in the code). — Alex Schröder
I have two questions: 1) Is the database character set Unicode or still some 1-byte flavour? 2) Have you tried to set swing/java to the exact same character set as the DB? — Vincent Malgrat
The database now uses WE8ISO8859P1 (Latin 1). I haven't set Swing/Java to the same character set. How would I do it? I thought Java uses Unicode with UTF-16 encoding internally? — Alex Schröder

Mark Ransom Mark Ransom · Accepted Answer · 2012-10-09T16:20:05

The font file certainly has a mapping from Unicode to a glyph. Unfortunately the glyph is completely arbitrary and needn't have any relationship to the character it's mapped to, as you found with the Moon/Em-Dash. The mapping from your single-byte character to a Unicode codepoint can probably be found in the Windows Code Page 1252, but that's not what you want - you want character 0x97 to equate to a moon glyph such as ☽ FIRST QUARTER MOON U+263D rather than the — EM DASH U+2014. Unfortunately I can't suggest anything other than going to each character in the font and comparing it to the available Unicode characters.

How to map code points to unicode characters depending on the font used?

2 Answers