15
votes

You can write UTF-8/16/32 string literals in C++11 by prefixing the string literal with u8/u/U respectively. How must the compiler interpret a UTF-8 file that has non-ASCII characters inside of these new types of string literals? I understand the standard does not specify file encodings, and that fact alone would make the interpretation of non-ASCII characters inside source code completely undefined behavior, making the feature just a tad less useful.

I understand you can still escape single unicode characters with \uNNNN, but that is not very readable for, say, a full Russian, or French sentence, which typically contain more than one unicode character.

What I understand from various sources is that u should become equivalent to L on current Windows implementations and U on e.g. Linux implementations. So with that in mind, I'm also wondering what the required behavior is for the old string literal modifiers...

For the code-sample monkeys:

string utf8string a = u8"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
string utf16string b = u"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
string utf32string c = U"L'hôtel de ville doit être là-bas. Ça c'est un fait!";

In an ideal world, all of these strings produce the same content (as in: characters after conversion), but my experience with C++ has taught me that this is most definitely implementation defined and probably only the first will do what I want.

3

3 Answers

9
votes

In GCC, use -finput-charset=charset:

Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command line option. Currently the command line option takes precedence if there's a conflict. charset can be any encoding supported by the system's "iconv" library routine.

Also check out the options -fexec-charset and -fwide-exec-charset.

Finally, about string literals:

char     a[] = "Hello";
wchar_t  b[] = L"Hello";
char16_t c[] = u"Hello";
char32_t d[] = U"Hello";

The size modifier of the string literal (L, u, U) merely determines the type of the literal.

7
votes

How must the compiler interpret a UTF-8 file that has non-ASCII characters inside of these new types of string literals. I understand the standard does not specify file encodings, and that fact alone would make the interpretation of non-ASCII characters inside source code completely undefined behavior, making the feature just a tad less useful.

From n3290, 2.2 Phases of translation [lex.phases]

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. [Here's a bit about trigraphs.] Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)

There are a lot of Standard terms being used to describe how an implementation deals with encodings. Here's my attempt at as somewhat simpler, step-by-step description of what happens:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set [...]

The issue of file encodings is handwaved; the Standard only cares about the basic source character set and leaves room for the implementation to get there.

Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character.

The basic source set is a simple list of allowed characters. It is not ASCII (see further). Anything not in this list is 'transformed' (conceptually at least) to a \uXXXX form.

So no matter what kind of literal or file encoding is used, the source code is conceptually transformed into the basic character set + a bunch of \uXXXX. I say conceptually because what the implementations actually do is usually simpler, e.g. because they can deal with Unicode directly. The important part is that what the Standard call an extended character (i.e. not from the basic source set) should be indistinguishable in use from its equivalent \uXXXX form. Note that C++03 is available on e.g. EBCDIC platforms, so your reasoning in terms of ASCII is flawed from the get go.

Finally, the process I described happens to (non raw) string literals too. That means your code is equivalent as if you'd have written:

string utf8string a = u8"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";
string utf16string b = u"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";
string utf32string c = U"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";
0
votes

In principle, questions of encoding only matter when you output your strings by making them visible to humans, which is not a question of how the programming language is defined, as its definition deals only with coding computation. So, when you decide, whether what you see in your editor is going to be the same as what you see in the output (any kind of images, be they on the screen or in a pdf), you should ask yourselves which convention the way your user-interaction library and your operating system were coded assumes. (Here is, for example, this kind of information for Qt5: with Qt5, what you see as a user of the application and what you see as its programmer coincides, if the contents of the old-fashioned string literals for your QStrings are encoded as utf8 in your source files, unless you turn on another setting in the course of the application's execution).

As a conclusion, I think Kerrek SB is right, and Damon is wrong: indeed, the methods of specifying a literal in the code ought to specify its type, not the encoding that is used in the source file for filling its contents, as the type of a literal is what concerns computation done to it. Something like u"string" is just an array of “unicode codeunits” (that is, values of type char16_t), whatever the operating system or any other service software later does to them and however their job looks for you or for another user. You just get to the problem of adding another convention for yourselves, that makes a correspondence between the “meaning” of numbers under computation (namely, they present the codes of Unicode), and their representation on your screen as you work in your text editor. How and whether you as a programmer use that “meaning” is another question, and how you could enforce this other correspondence is naturally going to be implementation-defined, because it has nothing to do with coding computation, only with comfortability of a tool's use.