"Supporting" Unicode goes well beyond using wchar_t
or std::wstring
(which are merely "types suitable for some wide-character encoding which might or might not be actually Unicode depending on current locale and platform").
Think things like isalpha()
, tokenizing, coverting to / from different encodings etc., and you get the idea.
Unless you know you can get away with build-in stuff like wchar_t
/ std::wstring
(and you wouldn't ask in that case), you are better off using the ICU library, which is the state-of-the-art implementation for Unicode support. (Even the otherwise-recommendable Boost.Locale relies on ICU to provide the actual logic.)
The C way of doing Unicode in ICU are arrays of type UChar []
(UTF-16), the C++ way is the class icu::UnicodeString
. I happen to work with a legacy codebase that goes great lengths to "make do" with UChar []
for claims of performance (shared references, memory pooling, copy-on-write etc.), but still fails to outperform icu::UnicodeString
, so you might feel safe in using the latter even in an embedded environment. They did a good job there.
Post scriptum: Take note that wchar_t
is of implementation-defined length; 32bit on the Unixes I know of, and 16bit on Windows - which gives additional trouble since wchar_t
should be "wide", but UTF-16 is still "multibyte" when it comes to Unicode. If you can rely on the environment supporting C++11, char16_t
resp. char32_t
would be better choices, yet still agnostic of finer print like combining characters.
wchar_t
andstd::wstring
aren't needed to support Unicode. I'm sure that using UTF-8 (instead of UTF-16) will on the other hand force developers to think of Code Units much earlier, and not lead them into thinking that awchar_t
is a Character or a Code Point. I'm sure of this because it is very likely that they encounter non-ASCII characters far more often than non-BMP characters. And I'm hoping that using UTF-8 will in turn make the developer to think even further of the complexity of Unicode. – dalle