2
votes

I am working on a program in Java that only deals with capital letters. During some processing, I am using the int value of chars of these capital letters. I understand that the value of the capital letters are the same in Unicode and ASCII, but when referring to these int values, should I be saying that they are the Unicode values or the ASCII values? I just want to make sure that I'm using the correct terminology in terms of the language.

3

3 Answers

2
votes

It should be referred to as a Unicode code unit. A Java char is a 16-bit Unicode code unit, as opposed to a 32-bit Unicode code point (it was originally thought that Unicode would be 16-bit). It will always take 16 bits, regardless of what the value is. ASCII is 7-bit (8 if you consider the 0 padding/error-checking bit). Thus, the term doesn't fully apply even if the actual value is the same.

1
votes

If the characters will only ever be ASCII, you can refer to them as ASCII. Otherwise, you should use the term Unicode which, as you state, is a proper superset of ASCII. Keep in mind that, even though you refer to them as ASCII, the encoding may need to be changed if you're sending them to something that expects real (octet-based) ASCII.

If you're software only handles code points in the ASCII range (and see below, this is not usually a good idea), it's much easier to say (to users, or in the documentation) "ASCII values" than "Unicode values in the ASCII range" :-)

It's actually misleading to refer to your values as Unicode code points in the context of doing things to uppercase letters, if you only handle the uppercase letters in the ASCII range.

Any new software nowadays should be written with Unicode in mind, and that includes the fact the uppercase letters are not restricted to the ASCII range.

For example, there's a chunk of Greek characters nowhere near the ASCII range that have upper and lowercase properties. The SpecialCasing.txt file shows these properties and there's also a FAQ on the subject.

0
votes

The correct and proper term according the Unicode Glossary for the numeric code is its code point. For example:

  • The code point for DIGIT ONE is 3116 (4910), normally written U+0031.
  • The code point for POUND SIGN is U+00A3
  • The code point for LATIN SMALL LETTER I WITH DIAERESIS is U+00EF.
  • The code point for GREEK SMALL LETTER MU is U+03BC.
  • The code point for LATIN SMALL LETTER F WITH DOT ABOVE is U+1E1F.
  • The code point for REPLACEMENT CHARACTER is U+FFFD.
  • The code point for MUSICAL SYMBOL DOUBLE FLAT is U+1D12B.
  • The code point for MATHEMATICAL ITALIC CAPITAL R is U+1D445.
  • The code point for EXTRATERRESTRIAL ALIEN is U+1F47D.
  • U+100002 is an assigned code point in the Supplementary_Private_Use_Area_B block.
  • The assigned name of code point U+0041 is LATIN CAPITAL LETTER A.
  • The assigned name of code point U+1F47E is ALIEN MONSTER.
  • Code point U+0FFE is unassigned, and so has no name.

And so on and so forth.