26
votes

I'm trying to figure out what "continuation bytes" are (for curiousity sake) in the UTF-8 encoding.

Wikipedia introduces this term in the UTF-8 article without defining it at all

Google search returns no useful information either. I'm about to jump into the official specification, but would preferably read a high-level summary first.

3
Looks like somebody just edited the Wikipedia article. (:tripleee

3 Answers

47
votes

A continuation byte in UTF-8 is any byte where the top two bits are 10.

They are the subsequent bytes in multi-byte sequences. The following table may help:

Unicode code points  Encoding  Binary value
-------------------  --------  ------------
 U+000000-U+00007f   0xxxxxxx  0xxxxxxx

 U+000080-U+0007ff   110yyyxx  00000yyy xxxxxxxx
                     10xxxxxx

 U+000800-U+00ffff   1110yyyy  yyyyyyyy xxxxxxxx
                     10yyyyxx
                     10xxxxxx

 U+010000-U+10ffff   11110zzz  000zzzzz yyyyyyyy xxxxxxxx
                     10zzyyyy
                     10yyyyxx
                     10xxxxxx

Here you can see how the Unicode code points map to UTF-8 multi-byte byte sequences, and their equivalent binary values.

The basic rules are this:

  1. If a byte starts with a 0 bit, it's a single byte value less than 128.
  2. If it starts with 11, it's the first byte of a multi-byte sequence and the number of 1 bits at the start indicates how many bytes there are in total (110xxxxx has two bytes, 1110xxxx has three and 11110xxx has four).
  3. If it starts with 10, it's a continuation byte.

This distinction allows quite handy processing such as being able to back up from any byte in a sequence to find the first byte of that code point. Just search backwards until you find one not beginning with the 10 bits.

Similarly, it can also be used for a UTF-8 strlen by only counting non-10xxxxxx bytes.

1
votes

In short words, continuation bytes are the bytes except first byte or single byte. In UTF-8, continuation bytes are started with 0x10.

-4
votes

“Continuation byte” isn’t a term but a normal English word and the term “byte.” If used as a pseudo-term, it may confuse the reader.

The Unicode Standard uses this expression in one place only, Ch. 5, clause 5.22: “For example, consider the first three bytes of a four-byte UTF-8 sequence, followed by a byte which cannot be a valid continuation byte: .” In this context, the meaning is clear: it’s just a byte that continues something, namely a sequence of bytes.

The Wikipedia page apparently uses “continuation byte” to mean any byte in the UTF-8 encoding except the first byte of the encoded form of a character.