1
votes

Wikipedia

Unicode comprises 1,114,112 code points in the range 0hex to 10FFFFhex

I am little puzzled that unicode encoding can take up-to 4 bytes. Could not 1,114,112 code points comfortably fit in 3 bytes? May be I am missing some special situations where it needs 4 bytes; please some concrete example if any?

3
Did you already read the Wikipedia article on the history of the UTF-8 encoding? That should answer a lot of questions.Roland Illig
I did read it, but there was probably some gap in my understanding it fully or may be I am overthinking it. I am guessing code points encoding using 1-4 bytes is more of a rule than what is actually needed to fit the current unicode code points which has limit of 21 bits. I guess they are using 32 bits instead of going to 24 bit to make room for the future.Saturday Sherpa
Possible duplicate of Why is there no UTF-24?phuclv
Unicode is not an encoding. It make no sense to have a size for unicode code point. Unicode is a mapping between code point and semantic name (e.g. 'LATIN CAPITAL LETTER A'). You are free to choose your own encodingGiacomo Catenazzi
A Unicode encoding of your own invention can because codepoints range less than 21 bits. You could even invent one with one to three 8-bit code units. UTF-8 cannot; It's already been invented not doing so. So, what's the question?Tom Blodget

3 Answers

4
votes

The Wikipedia article on the history of UTF-8 says that an earlier version of UTF-8 allowed more than 21 bits to be encoded. These encodings took 5 or even 6 bytes.

After it became clear that 2^21 code points will probably be enough for the remaining time of humankind (same thinking as with 5 bits, 6 bits, 7 bits, 8 bits and 16 bits), the encodings for 5 and for 6 bytes were simply forbidden. All other encoding rules were kept, for backwards compatibility.

As a consequence, the number space for the Unicode code points is now 0..10FFFF, which is even a bit less than 21 bits. Therefore it might be worth checking whether these 21 bits fit into the 24 bits of 3 bytes, instead of the current 4 bytes.

One important property of UTF-8 is that each byte that is part of a multibyte encoding has its highest bit set. To distinguish the leading byte from the trailing bytes, the leading byte has the second-highest bit set, while the trailing bytes have the second-highest bit cleared. This property ensures a consistent ordering. Therefore the characters could be encoded like this:

0xxx_xxxx                        7 bits freely chooseable
110x_xxxx 10xx_xxxx             11 bits freely chooseable
1110_xxxx 10xx_xxxx 10xx_xxxx   16 bits freely chooseable

Now 7 + 11 + 16 bits = 16.04 bits, which is much shorter than the 21 bits needed. Therefore encoding all Unicode code points using up to 3 bytes per the current UTF-8 encoding rules is impossible.

You can define another encoding where the highest bit of each byte is the continuation bit:

0xxx_xxxx                        7 bits freely chooseable
1xxx_xxxx 0xxx_xxxx             14 bits freely chooseable
1xxx_xxxx 1xxx_xxxx 0xxx_xxxx   21 bits freely chooseable

Now you have enough space to encode all 21-bit code points. But that's an entirely new encoding, so you would have to establish this world-wide. Given the experience from Unicode, it will take about 20 years. Good luck.

2
votes

"unicode" is not an encoding. The common encodings for Unicode are UTF-8, UTF-16 and UTF-32. UTF-8 uses 1-, 2-, 3- or 4-byte sequences and is explained below. It is the overhead of the leading/trailing bit sequences that requires 4 bytes for a 21-bit value.

The UTF-8 encoding uses up to 4 bytes to represent Unicode code points using the following bit patterns:

1-byte UTF-8 = 0xxxxxxxbin = 7 bits = U+0000 to U+007F
2-byte UTF-8 = 110xxxxx 10xxxxxxbin = 11 bits = U+0080 to U+07FF
3-byte UTF-8 = 1110xxxx 10xxxxxx 10xxxxxxbin = 16 bits = U+0800 to U+FFFF
4-byte UTF-8 = 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxbin = 21 bits = U+10000 to U+10FFFF

The advantage of UTF-8 is the lead bytes are unique patterns, and trailing bytes are a unique pattern and allow for easy validation of a correct UTF-8 sequence.

Note also it is illegal to use a longer encoding for a Unicode value that fits into a smaller sequence. For example:

1100_0001 1000_0001bin or C1 81hex encodes U+0041, but 0100_0001bin (41hex) is the shorter sequence.

Ref: https://en.wikipedia.org/wiki/UTF-8

0
votes

I expand my comment.

Unicode is not an encoding. It make no sense to have a size for unicode code point. Unicode is a mapping between code point and semantic name (e.g. 'LATIN CAPITAL LETTER A'). You are free to choose your own encoding.

Originally Unicode wanted to be a universal coding that fit into 16-bit (so the Unification Japanese/Chinese). As you see, it failed on this target. And a second point (very important) was to be able to convert to Unicode and back without loss of data (this simplify the conversion to Unicode: one tool at a time, at any layer).

So, there were a problem on how to expand Unicode to support more than 16-bits, but on the same time, not to break all Unicode programs. The idea was to use surrogates, so programs that just know about 16 bit Unicode (UCS-2) can still work (and BTW python2, and Javascript know just UCS-2, and they still work fine. The language do no need to know that Unicode code points could have more than 16 bits.

Surrogates gave the upper limit of actual Unicode (so not equal a power of 2).

Later it was designed UTF-8. The characteristic (by design): being compatible with ASCII (on 7 bit characters), encoding all Code points (also > 16 bit), and bee able to go to a random position and synchronize quickly where a character will start. This last point takes some address space, so the text is not as dense as it can be, but it is much more practical (and quick to "scroll" files). These extra data (for synchronization) made impossible to code all new Unicode code points into 3 bytes, with UTF-8.

You may use a UTF-24 (see the comment), but you lose the UFT-8 advantage to be compatible with ASCII, but also with UTF-16 you often have just 2 bytes (and not 4).

Remember: the Unicode code point above 16 bit are the more seldom: ancient languages, better representation (semantic) of existing glyphs, or new emojis (which hopefully one doesn't fill an entire long text just with emojis). So the utility of 3 bytes is not (yet) necessary. Maybe if aliens come to Earth and we should write with their new language characters, we will use mostly Unicode code point above 16 bits. Not something I think will happen soon.