2
votes

I was reading about little and big endian representations from this site http://www.geeksforgeeks.org/little-and-big-endian-mystery/.

Suppose we have a number 0x01234567, then in little endian it is stored as (67)(45)(23)(01) and in Big endian it is stored as (01)(23)(45)(67).

char *s= "ABCDEF"
int *p = (int *)s;
printf("%d",*(p+1)); // prints 17475 (value of DC)

After seeing the printed value here in the above code, it seems that string is stored as (BA)(DC)(FE).

Why is it not stored like (EF)(CD)(AB) from LSB to MSB as in first example? I thought that endianess means ordering of bytes within multi-bytes. So the ordering should be with respect to "whole 2 bytes" as in 2nd case and not within those 2 bytes right?

4
"After seeing the printed value here in the above code," — What printed value? On my little-endian machine, the given code prints 17989 (hex: 0x4645), which seems perfectly normal to me.jwodder
this looks like UB to me. s points to 6 bytes. you equate p to s, but when you print you do p+1. Assuming you have 4 byte ints, this will point p to 'E'. The next byte is 'F', and then the next 2 bytes are beyond your allocated space. But that aside, looks good to me, my little endian printout is 0x25004645. 0x45 is 'E', 0x46 is 'F', and 0x00 and 0x25 are no man's land.yano
@yano, My compiler considers CD as "DC". I have 2 byte ints. See my edit.Sagar P
ok,, with 2 byte int's (shorts on my machine), forget the UB, still looks good. My printout now is 0x4443, where 0x43 is 'C' and 0x44 is 'D'. I suspect you're confusing string ASCII characters with hex values? Each character in your string corresponds to a byte, which can be represented with a 2 digit hex number. Use the printf format specifier "%x" to print in hex. 17475 is indeed 0x4443, which is what is expected from a little endian machine.yano
"ABCDEF" and 0xABCDEF are very different...Breaking not so bad

4 Answers

11
votes

Working with 2 byte ints, this is what you have in memory

memAddr  |  0  |  1  |  2  |  3  |  4  |  5  |  6   |
data     | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | '\0' |
            ^ s points here
                        ^ p+1 points here

Now, it looks like you're using ASCII encoding, so this is what you really have in memory

memAddr  |  0   |  1   |  2   |  3   |  4   |  5   |  6   |
data     | 0x41 | 0x42 | 0x43 | 0x44 | 0x45 | 0x46 | 0x00 |
            ^ s points here
                          ^ p+1 points here

So for a little endian machine, that means the least significant bytes for a multi-byte type come first. There's no concept of endianess for a single byte char. An ASCII string is just a string of chars.. this has no endianess. Your ints are 2 bytes. So for an int starting at memory location 2, this byte is the least significant, and the one at address 3 is the most significant. This means the number here, read the way people generally read numbers, is 0x4443 (17475 in base 10, "DC" as an ASCII string), since 0x44 in memory location 3 is more significant than 0x43 in memory location 2. For big endian, of course, this would be reversed, and the number would be 0x4344 (17220 in base 10, "CD" as an ASCII string).

EDIT:

Addressing your comment... A c string is a NUL terminated array of chars, that's absolutely correct. Endianess only applies to the primitive types, short, int, long, long long, etc. ("primitive types" may be incorrect nomenclature, someone who knows can correct me). An array is simply a section of contiguous memory where 1 or more types occur directly next to each other, stored sequentially. There is no concept of endianess for the entire array, however, endianess does apply to the primitive types of the individual elements of the array. Let's say you have the following, assume 2 byte ints:

int array[3];  // with 2 byte ints, this occupies 6 contiguous bytes in memory
array[0] = 0x1234;
array[1] = 0x5678;
array[2] = 0x9abc;

This is what memory looks like: It will look like this no matter for a big or little endian machine

memAddr   |    0-1   |    2-3   |    4-5   |
data      | array[0] | array[1] | array[2] |

Notice there is no concept of endianess for the array elements. This is true no matter what the elements are. The elements could be primitive types, structs,, anything. The first element in the array is always at array[0].

But now, if we look at the what's actually in the array, this is where endianess does come into play. For a little endian machine, memory will look like this:

memAddr   |  0   |  1   |  2   |  3   |  4   |  5   |
data      | 0x34 | 0x12 | 0x78 | 0x56 | 0xbc | 0x9a |
             ^______^      ^______^      ^______^
             array[0]      array[1]      array[2]

The least significant bytes are first. A big endian machine would look like this:

memAddr   |  0   |  1   |  2   |  3   |  4   |  5   |
data      | 0x12 | 0x34 | 0x56 | 0x78 | 0x9a | 0xbc |
             ^______^      ^______^      ^______^
             array[0]      array[1]      array[2]

Notice the contents of each element of the array is subject to endianess (because it's an array of primitive types.. if it was an array of structs, the struct members wouldn't subject to some kind of endianess reversal,, endianess only applies to primitives). However, whether on the big or little endian machine, the array elements are still in the same order.

Getting back to your string, a string is simply a NUL terminated array of characters. chars are single bytes, so there's only one way to order them. Consider the code:

char word[] = "hey";

This is what you have in memory:

memAddr   |    0    |    1    |    2    |    3    |
data      | word[0] | word[1] | word[2] | word[3] |
                  equals NUL terminator '\0' ^

Just in this case, each element of the word array is a single byte, and there's only one way to order a single item, so whether on a little or big endian machine, this is what you'll have in memory:

memAddr   |  0   |  1   |  2   |  3   |
data      | 0x68 | 0x65 | 0x79 | 0x00 |

Endianess only applies to multi-byte primitive types. I highly recommend poking around in a debugger to see this in live action. All the popular IDEs have memory view windows, or with gdb you can print out memory. In gdb you can print memory as bytes, halfwords (2 bytes), words (4 bytes), giant words (8 bytes), etc. On a little endian machine, if you print out your string as bytes, you'll see the letters in order. Print out as halfwords, you'll see every 2 letters "reversed", print out as words, every 4 letters "reversed", etc. On a big endian machine, it would all print out in the same "readable" order.

3
votes

It seems there is a little confusion between the string of characters

1)  "ABCDEF"

and the number 11,259,375 which expressed in hexadecimal is

2)  0xABCDEF

In the first case, each letter takes a whole byte.
In the second case, we have six hexadecimal digits ; one hex digit takes 4 bits, so two digits are needed in a byte.

Endianness wise, in case
1) characters 'A', then 'B' etc.. are written sequentially in memory. 'A' is 0x41, 'B' 0x42... In case
2) this is a multi bytes integer which byte-order depends on architecture. Say the number is 4 bytes, a big-endian arc would store in memory (hex) 00 AB CD EF ; little-endian will store in this order: EF CD AB 00

Big endian

A  B  C  D  E  F
41 42 43 44 45 46   [ text ]
00 AB CD EF         [ integer ] 
----(addresses)---->

Little endian

----(addresses)---->
A  B  C  D  E  F
41 42 43 44 45 46   [ text ]
EF CD AB 00         [ integer ]

In your case

char *s= "ABCDEF";     // text
int *p = (int *)s;     //
printf("%d",*(p+1));   // *(p+1) is p[1]

since your implementation has sizeof(int) == 2, the number printed (17475) is 0x4443, or 'DC' (characters), having 0x44 ('D') as the MSB and 0x43 ('C') as LSB shows that your architecture is little-endian.

Writing a string of chars (sequentially) in memory and reading a couple of them as an int gives a number that depends on endianness. Yes, endianness matters in this case.

1
votes

Endianness doesn't come into play when talking about storing bytes as in the char const array that is pointed to by s. If you examined the memory at *s you would find the bytes 'a', 'b', 'c' ..., when interpreted as an int on a little endian system however it would get interpreted as "DCBA".

Remember that each char is already a byte, if you had char const * s = "0xfedcab09"; and you did a printf("%d", *(int const *)s); on a little endian system then it would print as whatever 0x9abcdef comes out as in decimal.

0
votes

The confusion presented here is due to notation.

The string “ABCDEF” can be interpreted (and stored) multiple ways.

In a character string each letter takes an entire byte (char).

char s[] = { 'A', 'B', 'C', 'D', 'E', 'F', 0 };

However, the hexadecimal representation of the number ABCDEF is different, each digit ('0'..'9' and 'A'..'F') represents only four bits, or half a byte. Thus, the number 0xABCDEF is the sequence of bytes

0xAB 0xCD 0xEF

This is where endianness becomes an issue:

  • Little Endian: least significant byte first
    int x = { 0xEF, 0xCD, 0xAB };
  • Big Endian: most significant byte first
    int x = { 0xAB, 0xCD, 0xEF }
  • Mixed Endian: <other random orderings>
    int x = { 0xEF, 0x00, 0xCD, 0xAB }