1
votes

In according to the Unicode specification

D91 UTF-16 encoding form: The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair.

The term "scalar value" is referred to unicode code points, that is the range of abstract ideas which must be encoded into specific byte sequence by different encoding forms (UTF-16 and so on). So it seems that this excerpt gist is in view of not all code points can be accommodated into one UTF-16 code unit (two bytes), there are ones which should be encoded into a pair of code units - 4 bytes (it's called "a surrogate pair").

However, the very term "scalar value" is defined as follows:

D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.

Wait... Does the Unicode have surrogate code points? What's the reason for it when UTF-16 can use 4 bytes to represent scalar points? Can anyone explain the rationale and how UTF-16 uses this code points?

2

2 Answers

2
votes

Yes, Unicode reserves ranges for surrogate code points:

Unicode reserves these ranges because these 16-bit values are used in surrogate pairs and no other symbol can be assigned to them. Surrogate pairs are two 16-bit values that encode code points above U+FFFF that do not fit into a single 16-bit value.

0
votes

Just for the sake of ultimate clarification.

  • UTF-16 uses 16-bits (2-bytes) code units. It means this encoding format encodes code points (= abstract ideas should be represented in a computer memory in some way), as a rule, into 16 bits (so an interpreter allegedly reads data as two bytes at a time).
  • UTF-16 does its best quite straightforward: the U+000E code point would encoded as 000E, the U+000F as 000F, and so on.
  • The issue is 16 bits can only come up with the range that is not sufficient to accommodate all unicode code points (0000..FFFF allows of only 65 536 possible values). We might use two 16-bits words (4 bytes) for the code points beyond this range (actually, my misunderstanding was about why UTF-16 doesn't do so). However, this approach results in bitter inability to decode some values. For example, if we encode the U+10000 code point into 0001 0000 (hex notation) how on earth should an interpreter decode such representation: as two subseqent code points U+0001 and U+0000 or as a single one U+10000?
  • The Unicode specification inclines to the better way. If there is a need to encode the range U+10000..U+10FFF (1 048 576 code points) then we should set apart 1 024 + 1 024 = 2 048 values from those which can be encoded with 16 bits (the spec chose the D800..DFFF range for it). And when an interpeter ecounters a D800..DBFF (High Surrogate Area) value in a computer memory, it knows it's not implied "fully-fledged" code point (not scalar value in terms of the spec) and it should read another 16 bits to get value from the DC00..DFFF range (Low Surrogate Area) and finally conclude which of the U+10000..U+10FFF code point was encoded with these 4 bytes (with this Surrogate Pair). Note, such scheme makes possible to encode 1 024 * 1 024 = 1 048 576 code points (and that's the very same amount we need).
  • In due to Unicode Codespace is defined as a range of integers from 0 to 10FFFF, we have to introduce the concept of surrogate code points (not code units) - the U+D800..U+DBFF range (because we can't exclude this range from the unicode codespace). In view of surrogate code points are designated for surrogate code units in the UTF-16 (see C1, D74), these code points can look as an UTF-16 relic.