A single 8-bit byte can hold a maximum of 256 values (0-255), so it cannot hold the majority of Unicode codepoints as-is (over 1 million).
UTFs (Unicode Transformation Formats) are standardized encodings designed to represent Unicode codepoints as encoded codeunits, which can then be expressed in byte format. The number expressed in a UTF's name represents the # of bits used to encode each codeunit:
- UTF-8 uses 8-bit codeunits
- UTF-16 uses 16-bit codeunits
- UTF-32 uses 32-bit codeunits
- and so on (there are other UTFs available, but these 3 are the main ones used).
Most UTFs are variable length (UTF-32 is not), requiring 1 or more codeunits to encode a given codepoint:
In UTF-8, codepoints in the ASCII range (U+0000 - U+007F) use 1 codeunit, higher codepoints use 2-4 codeunits depending on codepoint value.
In UTF-16, codepoints in the BMP (U+0000 - U+FFFF) use 1 codeunit, higher codepoints use 2 codeunits (known as a "surrogate pair").
In UTF-32, all codepoints use 1 32-bit codeunit.
So, for example, using the codepoints you mentioned, they would be encoded as follows:
U+0061 LATIN SMALL LETTER A
UTF | Codeunits | Bytes
-----------------------------------------
UTF-8 | x61 | x61
-----------------------------------------
UTF-16 | x0061 | x61 x00 (LE)
| | x00 x61 (BE)
-----------------------------------------
UTF-32 | x00000061 | x61 x00 x00 x00 (LE)
| | x00 x00 x00 x61 (BE)
U+00E2 SMALL LETTER A WITH CIRCUMFLEX
UTF | Codeunits | Bytes
-----------------------------------------
UTF-8 | xC3 xA2 | xC3 xA2
-----------------------------------------
UTF-16 | x00E2 | xE2 x00 (LE)
| | x00 xE2 (BE)
-----------------------------------------
UTF-32 | x000000E2 | xE2 x00 x00 x00 (LE)
| | x00 x00 x00 xE2 (BE)
U+0408 CYRILLIC CAPITAL LETTER JE
UTF | Codeunits | Bytes
-----------------------------------------
UTF-8 | xD0 x88 | xD0 x88
-----------------------------------------
UTF-16 | x0408 | x08 x04 (LE)
| | x04 x08 (BE)
-----------------------------------------
UTF-32 | x00000408 | x08 x04 x00 x00 (LE)
| | x00 x00 x04 x08 (BE)
And just for good measure, here are a couple of other examples:
U+20AC EURO SIGN
UTF | Codeunits | Bytes
-------------------------------------------
UTF-8 | xE2 x82 xAC | xE2 x82 xAC
-------------------------------------------
UTF-16 | x20AC | xAC x20 (LE)
| | x20 xAC (BE)
-------------------------------------------
UTF-32 | x000020AC | xAC x20 x00 x00 (LE)
| | x00 x00 x20 xAC (BE)
U+1F601 GRINNING FACE WITH SMILING EYES
UTF | Codeunits | Bytes
-----------------------------------------------
UTF-8 | xF0 x9F x98 x81 | xF0 x9F x98 x81
-----------------------------------------------
UTF-16 | xD83D xDE01 | x3D xD8 x01 xDE (LE)
| | xD8 x3D xDE x01 (BE)
-----------------------------------------------
UTF-32 | x0001F601 | x01 xF6 x01 x00 (LE)
| | x00 x01 xF6 x01 (BE)
As you can see, UTF-8 is not always the most efficient, in terms of byte size. It is good for Latin-based languages, but not so good for Asian languages, symbols, emoji, etc. On the other hand, it doesn't suffer from endian issues, like UTF-16 and UTF-32 do, so it is nice for data storage and communications. For most common uses of Unicode, UTF-8 is decent enough, though UTF-16 is better in some cases. UTF-16 is easier to work with than UTF-8 (UTF-32 is best) when processing Unicode data in memory, as there is less variation to deal with.
0x408
does not fit in byte. – user4003407content-encoding: gzip
) is often used, like for this page. – Tom Blodget