0
votes

There are several character encodings and seems like UTF-8 is on the lead and it is said to be the most efficient one so far. So I was thinking why wouldn't we just encode the Unicode characters with their code points ?

For Example: Character : 'a', 'â', 'Ј'... Code Point: U+0061, U+00E2, U+0408... Encoded byte : 61, e2, 408...

and so on. Wouldn't that be the most efficient and easy way to encode characters ?

1
0x408 does not fit in byte.user4003407
As @PetSerAl points out, this is not an encoding. When you need to represent U+0408, it will take more than one byte. UTF-8, UTF-16 and UTF-32 are different ways of encoding that information using more than one byte. They have different tradeoffs.janm
UTF-8 is common for files and streams. UTF-16 (or its precursor UCS-2) have been used for in-memory text processing since VB4, Java, .NET, JavaScript, Win32 API, etc.Tom Blodget
Efficiency is relative. Where transmission efficiency is a concern, HTTP compression (header content-encoding: gzip) is often used, like for this page.Tom Blodget

1 Answers

1
votes

A single 8-bit byte can hold a maximum of 256 values (0-255), so it cannot hold the majority of Unicode codepoints as-is (over 1 million).

UTFs (Unicode Transformation Formats) are standardized encodings designed to represent Unicode codepoints as encoded codeunits, which can then be expressed in byte format. The number expressed in a UTF's name represents the # of bits used to encode each codeunit:

  • UTF-8 uses 8-bit codeunits
  • UTF-16 uses 16-bit codeunits
  • UTF-32 uses 32-bit codeunits
  • and so on (there are other UTFs available, but these 3 are the main ones used).

Most UTFs are variable length (UTF-32 is not), requiring 1 or more codeunits to encode a given codepoint:

  • In UTF-8, codepoints in the ASCII range (U+0000 - U+007F) use 1 codeunit, higher codepoints use 2-4 codeunits depending on codepoint value.

  • In UTF-16, codepoints in the BMP (U+0000 - U+FFFF) use 1 codeunit, higher codepoints use 2 codeunits (known as a "surrogate pair").

  • In UTF-32, all codepoints use 1 32-bit codeunit.

So, for example, using the codepoints you mentioned, they would be encoded as follows:

U+0061 LATIN SMALL LETTER A

UTF    | Codeunits | Bytes
-----------------------------------------
UTF-8  | x61       | x61
-----------------------------------------
UTF-16 | x0061     | x61 x00         (LE)
       |           | x00 x61         (BE)
-----------------------------------------
UTF-32 | x00000061 | x61 x00 x00 x00 (LE)
       |           | x00 x00 x00 x61 (BE)
U+00E2 SMALL LETTER A WITH CIRCUMFLEX

UTF    | Codeunits | Bytes
-----------------------------------------
UTF-8  | xC3 xA2   | xC3 xA2
-----------------------------------------
UTF-16 | x00E2     | xE2 x00         (LE)
       |           | x00 xE2         (BE)
-----------------------------------------
UTF-32 | x000000E2 | xE2 x00 x00 x00 (LE)
       |           | x00 x00 x00 xE2 (BE)
U+0408 CYRILLIC CAPITAL LETTER JE

UTF    | Codeunits | Bytes
-----------------------------------------
UTF-8  | xD0 x88   | xD0 x88
-----------------------------------------
UTF-16 | x0408     | x08 x04         (LE)
       |           | x04 x08         (BE)
-----------------------------------------
UTF-32 | x00000408 | x08 x04 x00 x00 (LE)
       |           | x00 x00 x04 x08 (BE)

And just for good measure, here are a couple of other examples:

U+20AC EURO SIGN

UTF    | Codeunits   | Bytes
-------------------------------------------
UTF-8  | xE2 x82 xAC | xE2 x82 xAC
-------------------------------------------
UTF-16 | x20AC       | xAC x20         (LE)
       |             | x20 xAC         (BE)
-------------------------------------------
UTF-32 | x000020AC   | xAC x20 x00 x00 (LE)
       |             | x00 x00 x20 xAC (BE)
U+1F601 GRINNING FACE WITH SMILING EYES

UTF    | Codeunits       | Bytes
-----------------------------------------------
UTF-8  | xF0 x9F x98 x81 | xF0 x9F x98 x81
-----------------------------------------------
UTF-16 | xD83D xDE01     | x3D xD8 x01 xDE (LE)
       |                 | xD8 x3D xDE x01 (BE)
-----------------------------------------------
UTF-32 | x0001F601       | x01 xF6 x01 x00 (LE)
       |                 | x00 x01 xF6 x01 (BE)

As you can see, UTF-8 is not always the most efficient, in terms of byte size. It is good for Latin-based languages, but not so good for Asian languages, symbols, emoji, etc. On the other hand, it doesn't suffer from endian issues, like UTF-16 and UTF-32 do, so it is nice for data storage and communications. For most common uses of Unicode, UTF-8 is decent enough, though UTF-16 is better in some cases. UTF-16 is easier to work with than UTF-8 (UTF-32 is best) when processing Unicode data in memory, as there is less variation to deal with.