in swift, how utf16 surrogate pair is represented in bit

Question

I'm currently learning swift using the book swift programming language 3.1.

In the book, it states that swift's String and Character type is fully unicode compliant, with each character represented by a 21 bits unicode scalar value. Each character can be view via utf8, 16, 32.

I understand how utf8 and utf32 works in the byte and bit level, but I'm having trouble understanding how utf16 works in the bit level.

I know that for characters whose code point can be fit into 16 bits, utf16 simply represent the character as a 16 bit number. But for characters whose representation require more than 16 bits, two 16 bits block is used (called surrogate pair, I believe).

But how is the two 16 bits block is presented in bit level?

Martin R Martin R · Accepted Answer · 2017-03-27T08:47:55

A "Unicode Scalar Value" is

Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF₁₆ and E000₁₆ to 10FFFF₁₆ inclusive.

Every Unicode scalar value can be represented as a sequence of one or two UTF-16 code units, as described in the Unicode Standard:

D91 UTF-16 encoding form

The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair, according to Table 3-5.
Table 3-5. UTF-16 Bit Distribution

Scalar Value              UTF-16
xxxxxxxxxxxxxxxx          xxxxxxxxxxxxxxxx
000uuuuuxxxxxxxxxxxxxxxx  110110wwwwxxxxxx 110111xxxxxxxxxx

Note: wwww = uuuuu - 1

There are 2²⁰ Unicode scalar values in the "Supplementary Planes" (U+10000..U+10FFFF), which means that 20 bits are sufficient to encode all of them in a surrogate pair. Technically this is done by subtracting 0x010000 from the value before splitting it into two blocks of 10 bits.

in swift, how utf16 surrogate pair is represented in bit

2 Answers