The C and C++ languages allow a huge amount of latitude in their implementations. C was written long before UTF-8 was "the way to encode text in single bytes": different systems had different text encodings.
So what the byte values are for a string in C and C++ are really up to the compiler. 'A'
is whatever the compiler's chosen encoding is for the character A
, which may not agree with UTF-8.
C++ has added the requirement that real UTF-8 string literals must be supported by compilers. The bit value of u8"A"[0]
is fixed by the C++ standard through the UTF-8 standard, regardless of the preferred encoding of the platform the compiler is targeting.
Now, much as most platforms C++ targets use 2's complement integers, most compilers have character encodings that are mostly compatible with UTF-8. So for strings like "hello world"
, u8"hello world"
will almost certainly be identical.
For a concrete example, from man gcc
-fexec-charset=charset
Set the execution character set, used for string and character constants. The default is UTF-8. charset can be any encoding supported by the system's iconv library routine.
-finput-charset=charset
Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command line option. Currently the command line option takes precedence if there's a conflict. charset can be any encoding supported by the system's iconv library routine.
is an example of being able to change the execution and input character sets of C/C++.
std::cout << int("A"[0]);
can print 193 (EBCDIC, for example) whilestatic_assert(u8"A"[0] == 0x41, "");
is an assertion that cannot fail. – R. Martinho Fernandes