16
votes

I've recently tried to get the full picture about what steps it takes to create platform independent C++ applications that support unicode. A thing that is confusing to me is that most howtos and stuff equalize the character encoding (i.e. ANSI or Unicode) and the character type (char or wchar_t). As I've learned so far, these are different things and there may exist a character sequence encodeded in Unicode but represented by std::string as well as a character sequence encoded in ANSI but represented as std::wstring, right?

So the question that comes to my mind is whether the C++ standard gives any guarantee about the encoding of string literals starting with L or does it just say it's of type wchar_t with implementation specific character encoding?

If there is no such guaranty, does that mean I need some sort of external resource system to provide non ASCII string literals for my application in a platform independent way? What is the prefered way for this? Resource system or proper encoding of source files plus proper compiler options?

3
There's no such thing as "encoded in unicode". If by unicode you mean widechar or UTF-16, then I suggest you reconsider the preferred encoding, and use UTF-8 instead. See stackoverflow.com/questions/1049947/…Pavel Radzivilovsky

3 Answers

42
votes

The L symbol in front of a string literal simply means that each character in the string will be stored as a wchar_t. But this doesn't necessarily imply Unicode. For example, you could use a wide character string to encode GB 18030, a character set used in China which is similar to Unicode. The C++03 standard doesn't have anything to say about Unicode, (however C++11 defines Unicode char types and string literals) so it's up to you to properly represent Unicode strings in C++03.

Regarding string literals, Chapter 2 (Lexical Conventions) of the C++ standard mentions a "basic source character set", which is basically equivalent to ASCII. So this essentially guarantees that "abc" will be represented as a 3-byte string (not counting the null), and L"abc" will be represented as a 3 * sizeof(wchar_t)-byte string of wide-characters.

The standard also mentions "universal-character-names" which allow you to refer to non-ASCII characters using the \uXXXX hexadecimal notation. These "universal-character-names" usually map directly to Unicode values, but the standard doesn't guarantee that they have to. However, you can at least guarantee that your string will be represented as a certain sequence of bytes by using universal-character-names. This will guarantee Unicode output provided the runtime environment supports Unicode, has the appropriate fonts installed, etc.

As for string literals in C++03 source files, again there is no guarantee. If you have a Unicode string literal in your code which contains characters outside of the ASCII range, it is up to your compiler to decide how to interpret these characters. If you want to explicitly guarantee that the compiler will "do the right thing", you'd need to use \uXXXX notation in your string literals.

3
votes

The C++03 does not mention unicode (future C++0x does). Currently you have to either use external libraries (ICU, UTF-CPP, etc.) or build your own solution using platform specific code. As others have mentioned, wchar_t encoding (or even size) is not specified. Consequently, string literal encoding is implementation specific. However, you can give unicode codepoints in string literals by using \x \u \U escapes.

Typically unicode apps in Windows use wchar_t (with UTF-16 encoding) as internal character format, because it makes using Windows APIs easier as Windows itself uses UTF-16. Unix/Linux unicode apps in turn usually use char (with UTF-8 encoding) internally. If you want to exchange data between different platforms, UTF-8 is usual choice for data transfer encoding.

2
votes

The standard makes no mention of encoding formats for strings.

Take a look at ICU from IBM (its free). http://site.icu-project.org/