C++11: Example of difference between ordinary string literal and UTF-8 string literal?

Question

A string literal that does not begin with an encoding-prefix is an ordinary string literal, and is initialized with the given characters.

A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8.

I don't understand the difference between an ordinary string literal and a UTF-8 string literal.

Can someone provide an example of a situation where they are different? (Cause different compiler output)

(I mean from the POV of the standard, not any particular implementation)

Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set.

An ordinary string literal is in whatever encoding the compiler likes best; an UTF-8 string literal is encoded in UTF-8 (as mandated by the standard). — Matteo Italia
@MatteoItalia: What do you mean "the compiler likes best" ? When you say it is "encoded in X", do you mean in the source file or in the executable? — Andrew Tomazos
No. The execution character set is "the encoding the compiler likes best". It does not have to be UTF-8. Example: std::cout << int("A"[0]); can print 193 (EBCDIC, for example) while static_assert(u8"A"[0] == 0x41, ""); is an assertion that cannot fail. — R. Martinho Fernandes
@Mechanicalsnail: You can override the execution (and source) character set in gcc with options to the compiler, but the default is UTF-8. — Andrew Tomazos

Yakk - Adam Nevraumont Yakk - Adam Nevraumont · Accepted Answer · 2013-02-04T03:06:19

The C and C++ languages allow a huge amount of latitude in their implementations. C was written long before UTF-8 was "the way to encode text in single bytes": different systems had different text encodings.

So what the byte values are for a string in C and C++ are really up to the compiler. 'A' is whatever the compiler's chosen encoding is for the character A, which may not agree with UTF-8.

C++ has added the requirement that real UTF-8 string literals must be supported by compilers. The bit value of u8"A"[0] is fixed by the C++ standard through the UTF-8 standard, regardless of the preferred encoding of the platform the compiler is targeting.

Now, much as most platforms C++ targets use 2's complement integers, most compilers have character encodings that are mostly compatible with UTF-8. So for strings like "hello world", u8"hello world" will almost certainly be identical.

For a concrete example, from man gcc

-fexec-charset=charset

Set the execution character set, used for string and character constants. The default is UTF-8. charset can be any encoding supported by the system's iconv library routine.

-finput-charset=charset

Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command line option. Currently the command line option takes precedence if there's a conflict. charset can be any encoding supported by the system's iconv library routine.

is an example of being able to change the execution and input character sets of C/C++.

C++11: Example of difference between ordinary string literal and UTF-8 string literal?

1 Answers