4
votes

I saved some strings in Microsoft Agenda in Unicode big endian format (UTF-16BE). When I open it with the shell command xxd to see the binary value, write it down, and get the value of the Unicode code point by ord() to get the ordinal value character by character (this is a python built-in function which takes a one-character Unicode string and returns the code point value), and compare them, I find they are equal.

But I think that the Unicode code point value is different to UTF-16BE — one is a code point; the other is an encoding format. Some of them are equal, but maybe they are different for some characters.

Is the Unicode code point value equal to the UTF-16BE encoding representation for every character?

1

1 Answers

9
votes

No, codepoints outside of the Basic Multilingual Plane use two UTF-16 words (so 4 bytes).

For codepoints in the U+0000 to U+D7FF and U+E000 to U+FFFF ranges, the codepoint and UTF-16 encoding map one-to-one.

For codepoints in the range U+10000 to U+10FFFF, two words in the range U+D800 to U+DFFF are used; a lead surrogate from 0xD800 to 0xDBFF and a trail surrogate from 0xDC00 to 0xDFFF.

See the UTF-16 Wikipedia article on the nitty gritty details.

So, most UTF-16 big-endian bytes, when printed, can be mapped directly to Unicode codepoints. For UTF-16 little-endian you just swap the bytes around. For UTF-16 words in starting with a 0xD8 through to 0xDF byte, you'll have to map surrogates to the actual codepoint.