7
votes

UTF-16 is a two-byte character encoding. Exchanging the two bytes' addresses will produce UTF-16BE and UTF-16LE.

But I find the name UTF-16 encoding exists in the Ubuntu gedit text editor, as well as UTF-16BE and UTF-16LE. With a C test program I found my computer is little endian, and UTF-16 is confirmed as same encoding of UTF-16LE.

Also: There are two byte orders of a value (such as integer) in little/big endian computers. Little endian computers will produce little endian values in hardware (except the value produced by Java which always forms a big endian).

While text can be saved as UTF-16LE as well as UTF-16BE in my little endian computer, are characters produced one byte by one byte (such as the ASCII string, reference to [3] and the endianness of UTF-16 just defined by the human -- not as a result of the phenomenon that big endian machines write big endian UTF-16 while little endian machines write little endian UTF-16?

  1. http://www.ibm.com/developerworks/aix/library/au-endianc/
  2. http://teaching.idallen.com/cst8281/10w/notes/110_byte_order_endian.html
  3. ASCII strings and endianness
  4. Is it true that endianness only affects the memory layout of numbers,but not string? This a post of relation between endianness of string and machine.
3
"UTF-16" without qualification is Big Endian by default – but this doesn't mean that all applications behave according to the specification.一二三
@一二三Thank you!I alert with the difference between character and value.In C# program test,a integer saved in little endian machine is little endian.And it can not be correctly read when it is copyed to a big endian machine because byte address reversed. But for multi-bytes character in C#,does the byte address reversion happened too after copy from one machine to the other?hao.zhou
@一二三: That's not quite true. UTF-16 without a BOM is big-endian by default, but it will normally have a BOM which defines endianness.rici

3 Answers

12
votes

"is endian of UTF-16 the computer's endianness?"

The impact of your computer's endianness can be looked at from the point of view of a writer or a reader of a file.

If you are reading a file in a -standard- format, then the kind of machine reading it shouldn't matter. The format should be well-defined enough that no matter what the endianness of the reading machine is, the data can still be read correctly.

That doesn't mean the format can't be flexible. With "UTF-16" (when a "BE" or "LE" disambiguation is not used in the format name) the definition allows files to be marked as either big endian or little endian. This is done with something called the "Byte Order Mark" (BOM) in the first two bytes of the file:

https://en.wikipedia.org/wiki/Byte_order_mark

The existence of the BOM gives options to the writer of a file. They might choose to write out the most natural endianness for a buffer in memory, and include a BOM that matched. This wouldn't necessarily be the most efficient format for some other reader. But any program claiming UTF-16 support is supposed to be able to handle it either way.

So yes--the computer's endianness might factor into the endianness choice of a BOM-marked UTF-16 file. Still...a little-endian program is fully able to save a file, label it "UTF-16" and have it be big-endian. As long as the BOM is consistent with the data, it doesn't matter what kind of machine writes or reads it.

...what if there's no BOM?

This is where things get a little hazy.

On the one hand, the Unicode RFC 2781 and Unicode FAQ are clear. They say that a file in "UTF-16" format which starts with neither 0xFF 0xFE nor 0xFE 0xFF is to be interpreted as big endian:

the unmarked form uses big-endian byte serialization by default, but may include a byte order mark at the beginning to indicate the actual byte serialization used.

Yet to know if you have UTF-16-LE, UTF-16-BE, or UTF-16 file with no BOM...you need metadata outside the file telling you which of the three it is. Because there's not always a place to put that data, some programs wound up using heuristics.

Consider something like this from Raymond Chen (2007):

You might decide that programs that generate UTF-16 files without a BOM are broken, but that doesn't mean that they don't exist. For example,

cmd /u /c dir >results.txt

This generates a UTF-16LE file without a BOM.

That's a valid UTF-16LE file, but where would the "UTF-16LE" meta-label be stored? What are the odds someone passes that off by just calling it a UTF-16 file?

Empirically there are warnings about the term. The Wikipedia page for UTF-16 says:

If the BOM is missing, RFC 2781 says that big-endian encoding should be assumed. (In practice, due to Windows using little-endian order by default, many applications similarly assume little-endian encoding by default.)

And unicode.readthedocs.org says:

"UTF-16" and "UTF-32" encoding names are imprecise: depending of the context, format or protocol, it means UTF-16 and UTF-32 with BOM markers, or UTF-16 and UTF-32 in the host endian without BOM. On Windows, "UTF-16" usually means UTF-16-LE.

And further, the Byte-Order-Mark Wikipedia article says:

Clause D98 of conformance (section 3.10) of the Unicode standard states, "The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian."

Whether or not a higher-level protocol is in force is open to interpretation. Files local to a computer for which the native byte ordering is little-endian, for example, might be argued to be encoded as UTF-16LE implicitly. Therefore, the presumption of big-endian is widely ignored.

When those same files are accessible on the Internet, on the other hand, no such presumption can be made. Searching for 16-bit characters in the ASCII range or just the space character (U+0020) is a method of determining the UTF-16 byte order.

So despite the unambiguity of the standard, the context may matter in practice.

As @rici points out, the standard has been around for a while now. Still, it may pay to do double-checks on files claimed as "UTF-16". Or even to consider if you might want to avoid a lot of these issues and embrace UTF-8...

"Should UTF-16 be considered harmful?"

2
votes

No. Don't you see little endian computers receive packets from internet all the time which is big endian?

The encoding depends on how you write to memory, not how your architecture is.

2
votes

The Unicode encoding schemes are defined in section 3.10 of the Unicode standard. The standard defines seven encoding schemes:

  • 8 bit: UTF-8
  • 16 bit: UTF-16BE, UTF-16LE and UTF-16
  • 32 bit: UTF-32BE, UTF-32LE and UTF-32

In the case of the 16- and 32-bit encodings, the three variants differ in endianness, which may be explicit or indicated by starting the string with a Byte Order Mark (BOM) character, U+FEFF:

  • The LE variant is definitely little-endian; the low-order byte is encoded first. No BOM is permitted, so an initial character U+FEFF is a zero-width no-break space.
  • The BE variant is definitely big-endian; the high-order byte is encoded first. As with the LE variant, no BOM is permitted, so an initial character U+FEFF is a zero-width no-break space.
  • The variant without an endian mark may be big- or little-endian. Normally it will start with a BOM which defines the endianness. If there is no BOM, then big-endian encoding is assumed.

If you are going to use 16- or 32-bit encoding schemes for data serialization, it is generally recommended to use the unmarked variants with an explicit BOM. However, UTF-8 is a much more common data interchange format.

Although no endian marker is needed for UTF-8, it is permitted (but not recommended) to start a UTF-8 encoded string with a BOM; this can be used to differentiate between Unicode encoding schemes. Many Windows programs do this, and a U+FEFF at the beginning of a UTF-8 transmission should probably be treated as a BOM (and thus not as Unicode data).