35
votes

UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order.

If Utf-8 stored all code-points in a single byte, then it would make sense why endianness doesn’t play any role and thus why BOM isn’t required. But since code points 128 and above are stored using 2, 3 and up to 6 bytes, which means their byte order on big endian machines is different than on little endian machines, so how can we claim Utf-8 always has the same byte order?

Thank you

EDIT:

UTF-8 is byte oriented

I understand that if two byte UTF-8 character C consists of bytes B1 and B2 ( where B1 is first byte and B2 is last byte ), then with UTF-8 those two bytes are always written in the same order ( thus if this character is written to a file on little endian machine LEM, B1 will be first and B2 last. Similarly, if C is written to a file on big endian machine BEM, B1 will still be first and B2 still last).

But what happens when C is written to file F on LEM, but we copy F to BEM and try to read it there? Since BEM automatically swaps bytes ( B1 is now last and B2 first byte ), how will app ( running on BEM ) reading F know whether F was created on BEM and thus order of two bytes wasn’t swapped or whether F was transferred from LEM, in which case BEM automatically swapped the bytes?

I hope question made some sense

EDIT 2:

In response to your edit: big-endian machines do not swap bytes if you ask them to read a byte at a time.

a) Oh, so even though character C is 2 bytes longs, app ( residing on BEM ) reading F will read into memory just one byte at the time ( thus it will first read into memory B1 and only then B2 )

b)

In UTF-8, you decide what to do with a byte based on its high-order bits

Assuming file F has two consequent characters C and C1 ( where C consists of bytes B1 and B2 while C1 has bytes B3, B4 and B5 ). How will app reading F know which bytes belong together simply by checking each byte's high-order bits ( for example, how will it figure out that B1 and B2 taken together should represent a character and not B1,*B2* and B3)?

If you believe that you're seeing something different, please edit your question and include

I’m not saying that. I simply didn’t understand what was going on

c)Why aren't Utf-16 and Utf-32 also byte oriented?

2
"Byte oriented" means that you read a byte at a time, and decide what to do based on that byte. In UTF-8, you decide what to do with a byte based on its high-order bits. In UTF-16 and UTF-32, by comparison you deal with multiple bytes at a time, and have to organize them into words.Anon
In response to your edit: big-endian machines do not swap bytes if you ask them to read a byte at a time. If you believe that you're seeing something different, please edit your question and include (1) the source and destination machines and operating systems, (2) the exact steps that you're taking to copy the file (copy-paste from your terminal, do not paraphrase), and (3) proof that the file has been changed (for example, by showing byte-level output with od). Oh, and please use some highlight other than code.Anon
uh, for some reason I've only now noticed your first comment.Anyways, will edit my questionsuser437291
From UTF-8 FAQ (unicode.org/faq/utf_bom.html): Q: What is the definition of UTF-8? A: UTF-8 is the byte-oriented encoding form of Unicode. (links to further details follow).mlvljr
@mlvljr And with that comment, you mean..?Koray Tugay

2 Answers

39
votes

The byte order is different on big endian vs little endian machines for words/integers larger than a byte.

e.g. on a big-endian machine a short integer of 2 bytes stores the 8 most significant bits in the first byte, the 8 least significant bits in the second byte. On a little-endian machine the 8 most significant bits will the second byte, the 8 least significant bits in the first byte.

So, if you write the memory content of such a short int directly to a file/network, the byte ordering within the short int will be different depending on the endianness.

UTF-8 is byte oriented, so there's not an issue regarding endianness. the first byte is always the first byte, the second byte is always the second byte etc. regardless of endianness.

17
votes

To answer c): UTF-16 and UTF-32 represent characters as 16-bit or 32-bit words, so they are not byte-oriented.

For UTF-8, the smallest unit is a byte, thus it is byte-oriented. The alogrithm reads or writes one byte at a time. A byte is represented the same way on all machines.

For UTF-16, the smallest unit is a 16-bit word, and for UTF-32, the smallest unit is a 32-bit word. The algorithm reads or writes one word at a time (2 bytes, or 4 bytes). The order of the bytes in each word is different on big-endian and little-endian machines.