0
votes

I want that strings with Unicode characters be correctly handled in my file synchronizer application but I don't know how this kind of encoding works ?

In a unicode string, I can see that a unicode char has this form : "\uxxxx" where xs are numbers, how a normal C or C++ program interpret this kind of char ? (why there is a 'u' after '\' ? what's the effect ?)

In internet I see examples using "wide strings or wchar_t ?? So, what's the suitable object to handle unicode characters ? In rapidJson (which supports Unicode, UTF-8, UTF-16, UTF-32) , we can use const char* to store a JSOn that could have "wide characters" but those characters take more than a byte to be represented... I don't understand...

This is the kind of temporary arrangement I found for the moment (unicode->utf8?ascii?, listFolder is a std::string) :

boost::replace_all(listFolder, "\\u00e0", "à");
boost::replace_all(listFolder, "\\u00e2", "â");
boost::replace_all(listFolder, "\\u00e4", "ä");
...
2
Just "having Unicode characters" isn't a very precise definition. Do you plan to do anything with them? If you just need to store and forward strings, you can treat a Unicode string as an opaque byte string, with a length in bytes.MSalters

2 Answers

5
votes

The suitable object to handle Unicode strings in C++ is icu::UnicodeString (check "API References, ICU4C" in the sidebar), at least if you want to really handle Unicode strings (as opposed to just passing them from one point of your application to another).

wchar_t was an early attempt at handling international character sets, which turned out to be a failure because Microsoft's definition of wchar_t as two bytes turned out to be insufficient once Unicode was extended beyond code point 0x10000. Linux defines wchar_t as four bytes, but the inconsistency makes it (and its derived std::wstring) rather useless for portable programming.

TCHAR is a Microsoft define that resolves to char by default and to WCHAR if UNICODE is defined, with WCHAR in turn being wchar_t behind a level of indirection... yeah.

C++11 brought us char16_t and char32_t as well as the corresponding string classes, but those are still instances of basic_string<>, and as such have their shortcomings e.g. when trying to uppercase / lowercase characters that have more than one replacement character (e.g. the German ß would require to be extended to SS in uppercase; the standard library cannot do that).

ICU, on the other hand, goes the full way. For example, it provides normalization and decomposition, which the standard strings do not.


\uxxxx and \UXXXXXXXX are unicode character escapes. The xxxx are a 16-bit hexadecimal number representing a UCS-2 code point, which is equivalent to a UTF-16 code point within the Basic Multilingual Plane.

The XXXXXXXX are a 32-bit hex number, representing a UTF-32 code point, which may be any plane.

How those character escapes are handled depends on the context in which they appear (narrow / wide string, for example), making them somewhat less than perfect.

C++11 introduced "proper" Unicode literals:

u8"..." is always a const char[] in UTF-8 encoding.

u"..." is always a const uchar16_t[] in UTF-16 encoding.

U"..." is always a const uchar32_t[] in UTF-32 encoding.

If you use \uxxxx or \UXXXXXXXX within one of those three, the character literal will always be expanded to the proper code unit sequence.


Note that storing UTF-8 in a std::string is possible, but hazardous. You need to be aware of many things: .length() is not the number of characters in your string. .substr() can lead to partial and invalid sequences. .find_first_of() will not work as expected. And so on.

That being said, in my opinion UTF-8 is the only sane encoding choice for any stored text. There are cases to be made for handling texts as UTF-16 in-memory (the way ICU does), but on file, don't accept anything but UTF-8. It's space-efficient, endianess-independent, and allows for semi-sane handling even by software that is blissfully unaware of Unicode matters (see caveats above).

2
votes

In a unicode string, I can see that a unicode char has this form : "\uxxxx" where xs are numbers, how a normal C or C++ program interpret this kind of char ? (why there is a 'u' after '\' ? what's the effect ?)

That is a unicode character escape sequence. It will be interpreted as a unicode character. The u after the escape character is part of the syntax and it's what differentiates it from other escape sequences. Read the documentation for more information.

So, what's the suitable object to handle unicode characters ?

  • char for uft-8
  • char16_t for utf-16
  • char32_t for utf-32
  • The size of wchar_t is platform dependent, so you cannot make portable assumptions of which encoding it suits.

we can use const char* to store a JSOn that could have "wide characters" but those characters take more than a byte to be represented...

If you mean that you can store multi-byte utf-8 characters in a char string, then you're correct.

This is the kind of temporary arrangement I found for the moment (unicode->utf8?ascii?, listFolder is a std::string)

What you're attempting to do there is replacing some unicode characters with characters that have a plaftorm defined encoding. If you have other unicode characters besides those, then you end up with a string that has mixed encoding. Also, in some cases it may accidentally replace parts of other byte sequences. I recommend using library to convert encoding or do any other manipulation on encoded strings.