I am working on a program in Java that only deals with capital letters. During some processing, I am using the int
value of chars of these capital letters. I understand that the value of the capital letters are the same in Unicode and ASCII, but when referring to these int
values, should I be saying that they are the Unicode values or the ASCII values? I just want to make sure that I'm using the correct terminology in terms of the language.
3 Answers
It should be referred to as a Unicode code unit. A Java char
is a 16-bit Unicode code unit, as opposed to a 32-bit Unicode code point (it was originally thought that Unicode would be 16-bit). It will always take 16 bits, regardless of what the value is. ASCII is 7-bit (8 if you consider the 0 padding/error-checking bit). Thus, the term doesn't fully apply even if the actual value is the same.
If the characters will only ever be ASCII, you can refer to them as ASCII. Otherwise, you should use the term Unicode which, as you state, is a proper superset of ASCII. Keep in mind that, even though you refer to them as ASCII, the encoding may need to be changed if you're sending them to something that expects real (octet-based) ASCII.
If you're software only handles code points in the ASCII range (and see below, this is not usually a good idea), it's much easier to say (to users, or in the documentation) "ASCII values" than "Unicode values in the ASCII range" :-)
It's actually misleading to refer to your values as Unicode code points in the context of doing things to uppercase letters, if you only handle the uppercase letters in the ASCII range.
Any new software nowadays should be written with Unicode in mind, and that includes the fact the uppercase letters are not restricted to the ASCII range.
For example, there's a chunk of Greek characters nowhere near the ASCII range that have upper and lowercase properties. The SpecialCasing.txt
file shows these properties and there's also a FAQ on the subject.
The correct and proper term according the Unicode Glossary for the numeric code is its code point. For example:
- The code point for
DIGIT ONE
is 3116 (4910), normally written U+0031. - The code point for
POUND SIGN
is U+00A3 - The code point for
LATIN SMALL LETTER I WITH DIAERESIS
is U+00EF. - The code point for
GREEK SMALL LETTER MU
is U+03BC. - The code point for
LATIN SMALL LETTER F WITH DOT ABOVE
is U+1E1F. - The code point for
REPLACEMENT CHARACTER
is U+FFFD. - The code point for
MUSICAL SYMBOL DOUBLE FLAT
is U+1D12B. - The code point for
MATHEMATICAL ITALIC CAPITAL R
is U+1D445. - The code point for
EXTRATERRESTRIAL ALIEN
is U+1F47D. - U+100002 is an assigned code point in the
Supplementary_Private_Use_Area_B
block. - The assigned name of code point U+0041 is
LATIN CAPITAL LETTER A
. - The assigned name of code point U+1F47E is
ALIEN MONSTER
. - Code point U+0FFE is unassigned, and so has no name.
And so on and so forth.