Why both utf-16le and utf-16be exists? endianness efficiency - C

Question

I was wondering why both utf-16le and utf-16be exists? Is it considered to be "inefficient" for a big-endian environment to process a little-endian data?

Currently, this is what I use while storing 2 bytes var locally:

  unsigned char octets[2];
  short int shotint = 12345; /* (assuming short int = 2 bytes) */
  octets[0] = (shortint) & 255;
  octets[1] = (shortint >> 8) & 255);

I know that while storing and reading as a fixed endianness locally - there is no endian risk. I was wondering if it's considered to be "inefficient"? what would be the most "efficient" way to store a 2 bytes var? (while restricting the data to the environment's endianness, local use only.)

Thanks, Doori Bar

Aaron Digulla Aaron Digulla · Accepted Answer · 2010-07-27T12:54:49

This allows code to write large amounts of Unicode data to a file without conversion. During loading, you must always check the endianess. If you're lucky, you need no conversion. So in 66% of the cases, you need no conversion and only on 33% you must convert.

In memory, you can then access the data using the native datatypes of your CPU which allows for efficient processing.

That way, everyone can be as happy as possible.

So in your case, you need to check the encoding when loading the data but in RAM, you can use an array of short int to process it.

[EDIT] The fastest way to convert a 16bit value to 2 octets is:

char octet[2];
short * prt = (short*)&octet[0];
*ptr = 12345;

Now you don't know if octet[0] is the low or upper 8 bits. To find that out, write a know value and then examine it.

This will give you one of the encodings; the native one of your CPU.

If you need the other encoding, you can either swap the octets as you write them to a file (i.e. write them octet[1],octet[0]) or your code.

If you have several octets, you can use 32bit integers to swap two 16bit values at once:

char octet[4];
short * prt = (short*)&octet[0];
*ptr ++ = 12345;
*ptr ++ = 23456;

int * ptr32 = (int*)&octet[0];
int val = ((*ptr32 << 8) & 0xff00ff00) || (*ptr >> 8) & 0x00ff00ff);

Why both utf-16le and utf-16be exists? endianness efficiency - C

1 Answers