I am learning about UTF-16 encoding, and I have read that if you want to represent code points in the range of U+10000 to U+10FFFF, then you have to use surrogate pairs, which are in the range of U+D800 to U+DFFF.
So let's say I want to encode the following code point: U+10123 (10000000100100011 in binary):
First I layout this sequence of bits:
110110xxxxxxxxxx 110111xxxxxxxxxx
Then I fill the places with the x with the binary format of the code point:
1101100001000000 1101110100100011 (D840 DD23 in hexadecimal)
I have also read that the code points in the range of U+D800 to U+DFFF were removed from the Unicode character set, but I don't understand why this range was removed!
I mean this range can be easily encoded in 4 bytes, for example the following is the UTF-16 encoded format of the U+D812 code point (1101100000010010 in binary):
1101100000110110 1101110000010010 (D836 DC12 in hexadecimal)
Note: I was using UTF-16 Big Endian in my examples.
D840 DD23
and notD800 DD23
? – Roland IlligD840 DD23
, but when I encode it using this online tool: r12a.github.io/apps/conversion, I getD800 DD23
. Is my manual encoding method wrong? – paulD800 DD23
is the correct answer, what I did wrong is that I forgot to subtract0x10000
from the code point (this should have been the first step I make). – paul