How does UCS-2 display unicode code points that take 6 bytes in UTF-8?

Question

I was reading about unicode at http://www.joelonsoftware.com/articles/Unicode.html. Joel says UCS-2 encodes all unicode characters in 2 bytes whereas UTF-8 may take upto 6 bytes to encode some of the unicode characters. Would you please explain with an example, how a 6 byte UTF-8 encoded unicode character is encoded in UCS-2?

Of course, no encoding "displays" anything at all; encodings encode, and to actually display anything, you have to decode the encoding. — tripleee

Remy Lebeau Remy Lebeau · Accepted Answer · 2013-11-26T17:05:10

UCS-2 was created when Unicode had less than 65536 codepoints, so they all fit in 2 bytes max. Once Unicode grew to more than 65536 codepoints, UCS-2 became obsolete and was replaced with UTF-16, which encodes all of the UCS-2 compatible codepoints using 2 bytes and the rest using 4 bytes via surrogate pairs.

UTF-8 was originally written to encode codepoints up to 6 bytes (U+7FFFFFFF max) but was later limited to 4 bytes (U+1FFFFF max, though anything above U+10FFFF is forbidden) so that it is 100% compatible with UTF-16 back and forth and does not encode any codepoints that UTF-16 does not support. The maximum codepoint that both UTF-8 and UTF-16 support is U+10FFFF.

So, to answer your question, a codepoint that requires a 5- or 6-byte UTF-8 sequence ( U+200000 to U+7FFFFFFF) cannot be encoded in UCS-2, or even UTF-16. There are not enough bits available to hold such large codepoint values.

How does UCS-2 display unicode code points that take 6 bytes in UTF-8?

2 Answers