2
votes

I was playing around with unicode characters (without using wchar_t support) just for fun. I'm only using the regular char data type. I noticed that while printing them in hex they were showing up full 4 bytes instead of just one byte.

For ex. consider this c file:

#include <stdio.h>
#include <stdlib.h>

int main(void)
{
    char *s = (char *) malloc(100);
    fgets(s, 100, stdin);
    while (s && *s != '\0') {
            printf("%x\n", *s);
            s++;
    }
    return 0;
}

After compiling with gcc and giving input as 'cent' symbol (hex: c2 a2) I get the following output

$ ./a.out
ยข
ffffffc2: ?
ffffffa2: ?
a: 

So instead of just printing c2 and a2 I got the whole 4 bytes as if it's an int type.

Does this mean char is not really 1-byte in length, ascii made it look like 1-byte?

3
sizeof(char) == 1, that's guaranteed by the standard. Also: don't cast the return value of malloc. - user1203803
Your chars are getting promoted to int when you pass it into a printf(). - Mysticial

3 Answers

5
votes

Maybe the reason why the upper three bytes become 0xFFFFFF needs a bit more explanation?

The upper three bytes of the value printed for *s have a value of 0xFF due to sign extension.

The char value passed to printf is extended to an int before the call to printf.

This is due to C's default behaviour.

In the absence of signed or unsigned, the compiler can default to interpret char as signed char or unsigned char. It is consistently one or the other unless explicitly changed with a command line option or pragma's. In this case we can see that it is signed char.

In the absence of more information (prototypes or casts), C passes:

  • int, so char, short, unsigned char unsigned short are converted to int. It never passes a char, unsigned char, signed char, as a single byte, it always passes an int.
  • unsigned int is the same size as int so the value is passed without change

The compiler needs to decide how to convert the smaller value to an int.

  • signed values: the upper bytes of the int are sign extended from the smaller value, which effectively copies the top, sign bit, upwards to fill the int. If the top bit of the smaller signed value is 0, the upper bytes are filled with 0. If the top bit of the smaller signed value is 1, the upper bytes are filled with 1. Hence printf("%x ",*s) prints ffffffc2
  • unsigned values are not sign extended, the upper bytes of the int are 'zero padded'

Hence the reason C can call a function without a prototype (though the compiler will usually warn about that)

So you can write, and expect this to run (though I would hope your compiler issues warnings):

/* Notice the include is 'removed' so the C compiler does default behaviour */
/* #include <stdio.h> */

int main (int argc, const char * argv[]) {
    signed char schar[] = "\x70\x80";
    unsigned char uchar[] = "\x70\x80";

    printf("schar[0]=%x schar[1]=%x uchar[0]=%x uchar[1]=%x\n", 
            schar[0],   schar[1],   uchar[0],   uchar[1]);
    return 0;
}

That prints:

schar[0]=70 schar[1]=ffffff80 uchar[0]=70 uchar[1]=80

The char value is interpreted by my (Mac's gcc) compiler as signed char, so the compiler generates code to sign extended the char to the int before the printf call.

Where the signed char value has its top (sign) bit set (\x80), the conversion to int sign extends the char value. The sign extension fills in the upper bytes (in this case 3 more bytes to make a 4 byte int) with 1's, which get printed by printf as ffffff80

Where the signed char value has its top (sign) bit clear (\x70), the conversion to int still sign extends the char value. In this case the sign is 0, so the sign extension fills in the upper bytes with 0's, which get printed by printf as 70

My example shows the case where the value is unsigned char. In these two cases the value is not sign extended because the value is unsigned. Instead they are extended to int with 0 padding. It might look like printf is only printing one byte because the adjacent three bytes of the value would be 0. But it is printing the entire int, it happens that the value is 0x00000070 and 0x00000080 because the unsigned char values were converted to int without sign extension.

You can force printf to only print the low byte of the int, by using suitable formatting (%hhx), so this correctly prints only the value in the original char:

/* Notice the include is 'removed' so the C compiler does default behaviour */
/* #include <stdio.h> */

int main (int argc, const char * argv[]) {
    char schar[] = "\x70\x80";
    unsigned char uchar[] = "\x70\x80";

    printf("schar[0]=%hhx schar[1]=%hhx uchar[0]=%hhx uchar[1]=%hhx\n", 
           schar[0],   schar[1],   uchar[0],   uchar[1]);
    return 0;
}

This prints:

schar[0]=70 schar[1]=80 uchar[0]=70 uchar[1]=80

because printf interprets the %hhx to treat the int as an unsigned char. This does not change the fact that the char was sign extended to an int before printf was called. It is only a way to tell printf how to interpret the contents of the int.

In a way, for signed char *schar, the meaning of %hhx looks slightly misleading, but the '%x' format interprets int as unsigned anyway, and (with my printf) there is no format to print hex for signed values (IMHO it would be a confusing).

Sadly, ISO/ANSI/... don't freely publish our programming language standards, so I can't point to the specification, but searching the web might turn up working drafts. I haven't tried to find them. I would recommend "C: A Reference Manual" by Samuel P. Harbison and Guy L. Steele as a cheaper alternative to the ISO document.

HTH

4
votes

No. printf is a variable argument function, arguments to a variable argument function will be promoted to an int. And in this case the char was negative, so it gets sign extended.

1
votes

%x tells printf that the value to print is an unsigned int. So, it promotes the char to an unsigned int, sign extending as necessary and then prints out the resulting value.