Byte Order Mask: confusing the UTF encoding

Question

The Byte Order Mask (BOM) uses the Unicode character U+FEFF to determine the encoding of a text file according to the following rule:

+-------------+-----------------------+
|    Bytes    |     Encoding Form     |
+-------------+-----------------------+
| 00 00 FE FF | UTF-32, big-endian    |
| FF FE 00 00 | UTF-32, little-endian |
| FE FF       | UTF-16, big-endian    |
| FF FE       | UTF-16, little-endian |
| EF BB BF    | UTF-8                 |
+-------------+-----------------------+

My question is: is there any combination of bytes that can make one UTF encoding to be confused with another UTF encoding?

For example, if I have a UTF-16 big-endian encoded file without BOM and with the characters U+EFBB and U+BF40 (EF BB BF 40) can it be confused with an UTF-8 encoded file with BOM and the ASCII character @?

Tom Blodget Tom Blodget · Accepted Answer · 2018-04-09T03:36:21

Sure, without knowing the encoding, a sequence of U+0000 characters has an unknown length.

00 00 00 00  UTF-8   U+0000 U+0000 U+0000 U+0000     
00 00 00 00  UTF-16  U+0000 U+0000 
00 00 00 00  UTF-32  U+0000

BTW—Bytes that look like a byte order mark cannot be used to determine the encoding of a text file. In general, it's an unsolvable problem—data loss.

Byte Order Mask: confusing the UTF encoding

2 Answers