0
votes

The Byte Order Mask (BOM) uses the Unicode character U+FEFF to determine the encoding of a text file according to the following rule:

+-------------+-----------------------+
|    Bytes    |     Encoding Form     |
+-------------+-----------------------+
| 00 00 FE FF | UTF-32, big-endian    |
| FF FE 00 00 | UTF-32, little-endian |
| FE FF       | UTF-16, big-endian    |
| FF FE       | UTF-16, little-endian |
| EF BB BF    | UTF-8                 |
+-------------+-----------------------+

My question is: is there any combination of bytes that can make one UTF encoding to be confused with another UTF encoding?

For example, if I have a UTF-16 big-endian encoded file without BOM and with the characters U+EFBB and U+BF40 (EF BB BF 40) can it be confused with an UTF-8 encoded file with BOM and the ASCII character @?

2

2 Answers

1
votes

Sure, without knowing the encoding, a sequence of U+0000 characters has an unknown length.

00 00 00 00  UTF-8   U+0000 U+0000 U+0000 U+0000     
00 00 00 00  UTF-16  U+0000 U+0000 
00 00 00 00  UTF-32  U+0000  

BTW—Bytes that look like a byte order mark cannot be used to determine the encoding of a text file. In general, it's an unsolvable problem—data loss.

0
votes

The BOM is designed to find the byte order when the size is known. So there is no U+FFFE code. There is no further limitation on charset, so there can be some overlapping codes. (@TomBlodget has an example of a "degenerate" case)

BOM in UTF-8 is not really needed, but it should be preserved, in order to do a perfect round conversion from other unicode encoding. Just Windows started to use it to distinguish UTF-8 from other encoding (especially outside unicode encoding), and that it is not 100% reliable.

C0 and C1 are bytes not allowed on UTF-8, along various sequences (first bits on byte 1 defines the length of sequence, and so there should be exactly so many bytes with "continuation prefix" (0b10). So usually it is easy to find if a string it is UTF-8 (if not too short or "degenerate").

UTF-32 has valid values just from 0 to U+10FFFF, so this could be used to distinguish it from UTF16 (again, "degenerate" and short strings are not discriminable, OTOH we should expect very often 00 00 in UTF32, and usually no 00 00 on UTF16 normal text, but ev. at the end.).

Control characters and private character set should not be used on "public" Unicode text (but if you agree on the protocol, but so that should not be the case of the question).