UTF-8 Continuation bytes

Question

I'm trying to figure out what "continuation bytes" are (for curiousity sake) in the UTF-8 encoding.

Wikipedia introduces this term in the UTF-8 article without defining it at all

Google search returns no useful information either. I'm about to jump into the official specification, but would preferably read a high-level summary first.

paxdiablo paxdiablo · Accepted Answer · 2012-02-20T04:30:20

A continuation byte in UTF-8 is any byte where the top two bits are 10.

They are the subsequent bytes in multi-byte sequences. The following table may help:

Unicode code points  Encoding  Binary value
-------------------  --------  ------------
 U+000000-U+00007f   0xxxxxxx  0xxxxxxx

 U+000080-U+0007ff   110yyyxx  00000yyy xxxxxxxx
                     10xxxxxx

 U+000800-U+00ffff   1110yyyy  yyyyyyyy xxxxxxxx
                     10yyyyxx
                     10xxxxxx

 U+010000-U+10ffff   11110zzz  000zzzzz yyyyyyyy xxxxxxxx
                     10zzyyyy
                     10yyyyxx
                     10xxxxxx

Here you can see how the Unicode code points map to UTF-8 multi-byte byte sequences, and their equivalent binary values.

The basic rules are this:

If a byte starts with a 0 bit, it's a single byte value less than 128.
If it starts with 11, it's the first byte of a multi-byte sequence and the number of 1 bits at the start indicates how many bytes there are in total (110xxxxx has two bytes, 1110xxxx has three and 11110xxx has four).
If it starts with 10, it's a continuation byte.

This distinction allows quite handy processing such as being able to back up from any byte in a sequence to find the first byte of that code point. Just search backwards until you find one not beginning with the 10 bits.

Similarly, it can also be used for a UTF-8 strlen by only counting non-10xxxxxx bytes.

UTF-8 Continuation bytes

3 Answers