The suitable object to handle Unicode strings in C++ is icu::UnicodeString (check "API References, ICU4C" in the sidebar), at least if you want to really handle Unicode strings (as opposed to just passing them from one point of your application to another).
wchar_t
was an early attempt at handling international character sets, which turned out to be a failure because Microsoft's definition of wchar_t
as two bytes turned out to be insufficient once Unicode was extended beyond code point 0x10000. Linux defines wchar_t
as four bytes, but the inconsistency makes it (and its derived std::wstring
) rather useless for portable programming.
TCHAR
is a Microsoft define that resolves to char
by default and to WCHAR
if UNICODE
is defined, with WCHAR
in turn being wchar_t
behind a level of indirection... yeah.
C++11 brought us char16_t
and char32_t
as well as the corresponding string classes, but those are still instances of basic_string<>
, and as such have their shortcomings e.g. when trying to uppercase / lowercase characters that have more than one replacement character (e.g. the German ß
would require to be extended to SS
in uppercase; the standard library cannot do that).
ICU, on the other hand, goes the full way. For example, it provides normalization and decomposition, which the standard strings do not.
\uxxxx
and \UXXXXXXXX
are unicode character escapes. The xxxx
are a 16-bit hexadecimal number representing a UCS-2 code point, which is equivalent to a UTF-16 code point within the Basic Multilingual Plane.
The XXXXXXXX
are a 32-bit hex number, representing a UTF-32 code point, which may be any plane.
How those character escapes are handled depends on the context in which they appear (narrow / wide string, for example), making them somewhat less than perfect.
C++11 introduced "proper" Unicode literals:
u8"..."
is always a const char[]
in UTF-8 encoding.
u"..."
is always a const uchar16_t[]
in UTF-16 encoding.
U"..."
is always a const uchar32_t[]
in UTF-32 encoding.
If you use \uxxxx
or \UXXXXXXXX
within one of those three, the character literal will always be expanded to the proper code unit sequence.
Note that storing UTF-8 in a std::string
is possible, but hazardous. You need to be aware of many things: .length()
is not the number of characters in your string. .substr()
can lead to partial and invalid sequences. .find_first_of()
will not work as expected. And so on.
That being said, in my opinion UTF-8 is the only sane encoding choice for any stored text. There are cases to be made for handling texts as UTF-16 in-memory (the way ICU does), but on file, don't accept anything but UTF-8. It's space-efficient, endianess-independent, and allows for semi-sane handling even by software that is blissfully unaware of Unicode matters (see caveats above).