1
votes

I need to write an app on embedded device using C++. I may need to support Unicode too (though I am not an expert on it). I had a look at Joel Spoolsky's article too about Unicode: http://www.joelonsoftware.com/articles/Unicode.html

My question is given what I mentioned above, what is the way to go with Unicode in such a application in C++? Should I use wchar_t everywhere? or std::wstring?

What problems I may encounter in using wchar_t all the time? (this post mentions some problems one might encounter with unicode strings: Switching from std::string to std::wstring for embedded applications? - but I am still kind of confused as to don't know what to do exactly).

2
@dalle: I consider both the linked question and its "accepted" answer to be severely misguided. None of the problems mentioned are inherent to UTF-16, they are inherent to multibyte encodings and applications written in ignorance of multibyte implications. Using UTF-8 instead doesn't really solve the problems, and using UTF-32 still doesn't solve the issue of e.g. combining characters. You want to go beyond ISO-8859, you have to understand Unicode, multibyte, and the limits of wide characters. No way around it.DevSolar
What do you need to do with your Unicode strings? Once you start looking at individual characters, things get tricky and you'll need a library with robust Unicode support to do all your string manipulation, but if you just need to store (and maybe concatenate) valid Unicode strings, then you should be fairly safe.jalf
@jalf: "What do you need to do with your Unicode strings?" --> yes, this I am not sure exactly yet what I need to do with them though.pseudonym_127
@DevSolar: I just wanted to point out that wchar_t and std::wstring aren't needed to support Unicode. I'm sure that using UTF-8 (instead of UTF-16) will on the other hand force developers to think of Code Units much earlier, and not lead them into thinking that a wchar_t is a Character or a Code Point. I'm sure of this because it is very likely that they encounter non-ASCII characters far more often than non-BMP characters. And I'm hoping that using UTF-8 will in turn make the developer to think even further of the complexity of Unicode.dalle

2 Answers

6
votes

"Supporting" Unicode goes well beyond using wchar_t or std::wstring (which are merely "types suitable for some wide-character encoding which might or might not be actually Unicode depending on current locale and platform").

Think things like isalpha(), tokenizing, coverting to / from different encodings etc., and you get the idea.

Unless you know you can get away with build-in stuff like wchar_t / std::wstring (and you wouldn't ask in that case), you are better off using the ICU library, which is the state-of-the-art implementation for Unicode support. (Even the otherwise-recommendable Boost.Locale relies on ICU to provide the actual logic.)

The C way of doing Unicode in ICU are arrays of type UChar [] (UTF-16), the C++ way is the class icu::UnicodeString. I happen to work with a legacy codebase that goes great lengths to "make do" with UChar [] for claims of performance (shared references, memory pooling, copy-on-write etc.), but still fails to outperform icu::UnicodeString, so you might feel safe in using the latter even in an embedded environment. They did a good job there.

Post scriptum: Take note that wchar_t is of implementation-defined length; 32bit on the Unixes I know of, and 16bit on Windows - which gives additional trouble since wchar_t should be "wide", but UTF-16 is still "multibyte" when it comes to Unicode. If you can rely on the environment supporting C++11, char16_t resp. char32_t would be better choices, yet still agnostic of finer print like combining characters.

0
votes

You've read Joel's article, but it seems you have not understood it. std::wstring or strings of wchar_t are not Unicode, they are wide character strings that may contain UCS-2 or UTF-16 Unicode strings, or something else. std::string may contain plain ASCII, or ANSI w. codepage strings, or they may contain UTF-8 Unicode strings, or something else.

Both of these occur often: the std::wstring tends to be UTF-16 on Windows, std::string tends to be UTF-8 on POSIX.

DevSolar's advice is sound - have a look at ICU instead, it'll save you from an awful lot of headache and misunderstanding.