I was reading about unicode at http://www.joelonsoftware.com/articles/Unicode.html. Joel says UCS-2 encodes all unicode characters in 2 bytes whereas UTF-8 may take upto 6 bytes to encode some of the unicode characters. Would you please explain with an example, how a 6 byte UTF-8 encoded unicode character is encoded in UCS-2?
2 Answers
UCS-2 was created when Unicode had less than 65536 codepoints, so they all fit in 2 bytes max. Once Unicode grew to more than 65536 codepoints, UCS-2 became obsolete and was replaced with UTF-16, which encodes all of the UCS-2 compatible codepoints using 2 bytes and the rest using 4 bytes via surrogate pairs.
UTF-8 was originally written to encode codepoints up to 6 bytes (U+7FFFFFFF max) but was later limited to 4 bytes (U+1FFFFF max, though anything above U+10FFFF is forbidden) so that it is 100% compatible with UTF-16 back and forth and does not encode any codepoints that UTF-16 does not support. The maximum codepoint that both UTF-8 and UTF-16 support is U+10FFFF.
So, to answer your question, a codepoint that requires a 5- or 6-byte UTF-8 sequence ( U+200000 to U+7FFFFFFF) cannot be encoded in UCS-2, or even UTF-16. There are not enough bits available to hold such large codepoint values.
UCS-2 stores everything it can in two bytes, and does nothing about the code points that won't fit into that space. Which is why UCS-2 is pretty much useless today.
Instead, we have UTF-16, which looks like UCS-2 for all the two-byte sequences, but also allows surrogate pairs, pairs of two-byte sequences. Using those, remaining code points can be encoded using a total of 4 bytes.