A continuation byte in UTF-8 is any byte where the top two bits are 10
.
They are the subsequent bytes in multi-byte sequences. The following table may help:
Unicode code points Encoding Binary value
------------------- -------- ------------
U+000000-U+00007f 0xxxxxxx 0xxxxxxx
U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx
10xxxxxx
U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx
10yyyyxx
10xxxxxx
U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy xxxxxxxx
10zzyyyy
10yyyyxx
10xxxxxx
Here you can see how the Unicode code points map to UTF-8 multi-byte byte sequences, and their equivalent binary values.
The basic rules are this:
- If a byte starts with a
0
bit, it's a single byte value less than 128.
- If it starts with
11
, it's the first byte of a multi-byte sequence and the number of 1
bits at the start indicates how many bytes there are in total (110xxxxx
has two bytes, 1110xxxx
has three and 11110xxx
has four).
- If it starts with
10
, it's a continuation byte.
This distinction allows quite handy processing such as being able to back up from any byte in a sequence to find the first byte of that code point. Just search backwards until you find one not beginning with the 10
bits.
Similarly, it can also be used for a UTF-8 strlen
by only counting non-10xxxxxx
bytes.