Unicode strings on a embedded software

Question

I need to write an app on embedded device using C++. I may need to support Unicode too (though I am not an expert on it). I had a look at Joel Spoolsky's article too about Unicode: http://www.joelonsoftware.com/articles/Unicode.html

My question is given what I mentioned above, what is the way to go with Unicode in such a application in C++? Should I use wchar_t everywhere? or std::wstring?

What problems I may encounter in using wchar_t all the time? (this post mentions some problems one might encounter with unicode strings: Switching from std::string to std::wstring for embedded applications? - but I am still kind of confused as to don't know what to do exactly).

@dalle: I consider both the linked question and its "accepted" answer to be severely misguided. None of the problems mentioned are inherent to UTF-16, they are inherent to multibyte encodings and applications written in ignorance of multibyte implications. Using UTF-8 instead doesn't really solve the problems, and using UTF-32 still doesn't solve the issue of e.g. combining characters. You want to go beyond ISO-8859, you have to understand Unicode, multibyte, and the limits of wide characters. No way around it. — DevSolar
What do you need to do with your Unicode strings? Once you start looking at individual characters, things get tricky and you'll need a library with robust Unicode support to do all your string manipulation, but if you just need to store (and maybe concatenate) valid Unicode strings, then you should be fairly safe. — jalf
@jalf: "What do you need to do with your Unicode strings?" --> yes, this I am not sure exactly yet what I need to do with them though. — pseudonym_127
@DevSolar: I just wanted to point out that wchar_t and std::wstring aren't needed to support Unicode. I'm sure that using UTF-8 (instead of UTF-16) will on the other hand force developers to think of Code Units much earlier, and not lead them into thinking that a wchar_t is a Character or a Code Point. I'm sure of this because it is very likely that they encounter non-ASCII characters far more often than non-BMP characters. And I'm hoping that using UTF-8 will in turn make the developer to think even further of the complexity of Unicode. — dalle

DevSolar DevSolar · Accepted Answer · 2013-05-16T09:02:26

"Supporting" Unicode goes well beyond using wchar_t or std::wstring (which are merely "types suitable for some wide-character encoding which might or might not be actually Unicode depending on current locale and platform").

Think things like isalpha(), tokenizing, coverting to / from different encodings etc., and you get the idea.

Unless you know you can get away with build-in stuff like wchar_t / std::wstring (and you wouldn't ask in that case), you are better off using the ICU library, which is the state-of-the-art implementation for Unicode support. (Even the otherwise-recommendable Boost.Locale relies on ICU to provide the actual logic.)

The C way of doing Unicode in ICU are arrays of type UChar [] (UTF-16), the C++ way is the class icu::UnicodeString. I happen to work with a legacy codebase that goes great lengths to "make do" with UChar [] for claims of performance (shared references, memory pooling, copy-on-write etc.), but still fails to outperform icu::UnicodeString, so you might feel safe in using the latter even in an embedded environment. They did a good job there.

Post scriptum: Take note that wchar_t is of implementation-defined length; 32bit on the Unixes I know of, and 16bit on Windows - which gives additional trouble since wchar_t should be "wide", but UTF-16 is still "multibyte" when it comes to Unicode. If you can rely on the environment supporting C++11, char16_t resp. char32_t would be better choices, yet still agnostic of finer print like combining characters.

Unicode strings on a embedded software

2 Answers