If I understand well, it is possible to use both string and wstring to store UTF-8 text.
With char, ASCII characters take a single byte, some chinese characters take 3 or 4, etc. Which means that
str[3]
doesn't necessarily point to the 4th character.With
wchar_t
same thing, but the minimal amount of bytes used per characters is always 2 (instead of 1 forchar
), and a 3 or 4 byte wide character will take 2wchar_t
.
Right ?
So, what if I want to use string::find_first_of()
or string::compare()
, etc with such a weirdly encoded string ? Will it work ? Does the string class handle the fact that characters have a variable size ? Or should I only use them as dummy feature-less byte arrays, in which case I'd rather go for a wchar_t[]
buffer.
If std::string
doesn't handle that, second question: are there libraries providing string classes that could handle that UTF-8 encoding so that str[3]
actually points to the 3rd character (which would be a byte array from length 1 to 4) ?
str[3]
was the fourth code point, that's not necessarily the fourth user-perceived character. – user395760wchar_t
is implementation-defined, so not always 2 bytes. Moreover (IIRC) Windows uses it to store something like UTF-16, not UTF-8. See en.wikipedia.org/wiki/Wide_character – gx_