0
votes

I'm currently learning swift using the book swift programming language 3.1.

In the book, it states that swift's String and Character type is fully unicode compliant, with each character represented by a 21 bits unicode scalar value. Each character can be view via utf8, 16, 32.

I understand how utf8 and utf32 works in the byte and bit level, but I'm having trouble understanding how utf16 works in the bit level.

I know that for characters whose code point can be fit into 16 bits, utf16 simply represent the character as a 16 bit number. But for characters whose representation require more than 16 bits, two 16 bits block is used (called surrogate pair, I believe).

But how is the two 16 bits block is presented in bit level?

2

2 Answers

1
votes

A "Unicode Scalar Value" is

Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive.

Every Unicode scalar value can be represented as a sequence of one or two UTF-16 code units, as described in the Unicode Standard:

D91 UTF-16 encoding form

The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair, according to Table 3-5.

Table 3-5. UTF-16 Bit Distribution

Scalar Value              UTF-16
xxxxxxxxxxxxxxxx          xxxxxxxxxxxxxxxx
000uuuuuxxxxxxxxxxxxxxxx  110110wwwwxxxxxx 110111xxxxxxxxxx

Note: wwww = uuuuu - 1

There are 220 Unicode scalar values in the "Supplementary Planes" (U+10000..U+10FFFF), which means that 20 bits are sufficient to encode all of them in a surrogate pair. Technically this is done by subtracting 0x010000 from the value before splitting it into two blocks of 10 bits.

1
votes

The utf16 range D800...DFFF is reserved. Values below and above that are simple 16 bit code points. Values D800..DBFF, minus D800, are the high 10 bits of 20-bit codes beyond FFFC. The next two bytes contain the low 10 bits. Of course, endianness comes into the picture making us all wish we could just use utf8. Sigh.