0
votes

First, I could be wrong in my understanding on how UTF-8 encoding scheme works as I just began to use unicode. One interesting webpage I came across that says:

There is, however, an encoding scheme for Unicode characters that is compatible with ASCII. It is called UTF-8. Under UTF-8, the byte encoding of character numbers between 0 and 127 is the binary value of a single byte, just like ASCII. For character numbers between 128 and 65535, however, multiple bytes are used. If the first byte has a value between 128 and 255, it is interpreted to indicate the number of bytes that are to follow. The bytes following encode a single character number. All the bytes following in the character encoding also have values between 128 and 255, so there will never be any confusion between single-byte characters between 0 and 127, and bytes that are part of multi-byte character representations.

For example, the character é which has a character number of 233 in both Latin 1 and Unicode, is represented by a single byte with a value 233 in conventional Latin 1 encoding, but is represented as two bytes with the values 195 and 169 in UTF-8.

In my interpretation and understanding, because the character é has a value greater than 128 in unicode (233), it is represented by two bytes. The two bytes have values between 128 and 255 in order to differentiate between ASCII characters that need only one byte, 7-bits are technically used. But how can we reach the number 233 using the values 195 and 169 stored in the two bytes? Or what's the procedure to get 233 from the two bytes? Obviously, if we add the two values (the two bytes) we get 195 + 169 = 364 which is not the same as the code point for the character é, 233. What Am I missing here?

*I totally understand some characters will require more bytes to represent, but that's another story.

1
Look up UTF8 on Wikipedia. The algorithm is described.Mark Tolonen
You need to apply the concept of place values from middle school math, or, in bit arithmetic, shifting or multiplying. If you read C, there are many implementations of utf8.c.Tom Blodget

1 Answers

9
votes

UTF-8 is an encoding scheme. It is not enough to just add the raw bytes together, you have to remove the encoded portions first, and then concat (not add) the remaining bits together.

UTF-8 is formally defined in RFC 3629, which defines the following table:

Char. number range  |        UTF-8 octet sequence
   (hexadecimal)    |              (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

This is better visualized on Wikipedia:

UTF-8 encoding

Now, lets apply that to your example Unicode character, é, which is Unicode codepoint U+00E9 SMALL LETTER E WITH ACUTE.

If you look at the table, codepoint U+00E9 falls in the 2nd row, so it is encoded using two bytes:

110xxxxx 10xxxxxx

The 1s and 0s are literal, they must appear as shown in the encoded bytes. The xs represent the original bits of the codepoint that is being encoded.

Setting the highest bit to 1 ensures the bytes are not confused with 7-bit ASCII characters. The number of 1s in the first byte of an encoded sequence also serves to specify how many total bytes are in the complete sequence.

The full Unicode repertoire requires a maximum of 21 bits to represent all possible codepoints (up to and including U+10FFFF, which is the highest codepoint that UTF-16 can physically encode. UTF-8 can physically encode higher codepoints, but is artificially restricted by the RFC to maintain 100% compatibility with UTF-16). Since there is no 21-bit data type in most programming languages, the next highest data type is a 32-bit integer.

Codepoint U+00E9 is 0x000000E9 as a 32-bit number in hex. That is 00000000 0000000 00000000 11101001 in binary bits. The 2nd row of the table only uses 11 bits from the codepoint, so you would strip off the high 21 bits and fill in the xs with the remaining 11 low bits:

   11000000 10000000
OR    00011   101001
--------------------
   11000011 10101001 = 0xC3 0xA9

To reverse the process, simply remove the non-x bits from each byte, and join the remaining bits together:

    11000011 10101001
AND 00011111 00111111
---------------------
       00011   101001 = 11101001 = 0xE9

If you need help implementing this algorithm in a particular programming language, there are plenty of examples and tutorials floating around that show the algorithm from a coding perspective.