1
votes

Recent times I am coming across the conversion of UTF-8 encoding to string and vice vera. I understood that UTF-8 encoding is used to hold almost all the characters in the world while using char which is built in data type for string, only ASCII values can be stored.For a character in UTF-8 encoding the number of bytes required in memory is varied from one byte to 4 bytes but for 'char' type it is usually 1 byte.

My question is what happens in conversion from wstring to string or wchar to char ? Does the characters which require more than one byte is skipped? It seems it depends on implementation but I want to know what is the correct way of doing it.

Also does wchar is required to store unicode characters ? As far as I understood UNICODE characters can be stored in normal string as well. Why should we use wstring or wchar ?

2
char is not an encoding, but a data-type. And there is no conversion defined, only a plethora of conversion-functions, and you have to pick the appropriate one.Deduplicator
@Deduplicator : Thanks for correcting the mistake. Do we need wchar/wstring type for UTF-8 encoding ? I understood that we can use normal string or charevk1206

2 Answers

3
votes

Depends how you convert them.
You need to specify the source encoding type and the target encoding type.
wstring is not a format, it just defines a data type.

Now usually when one says "Unicode", one means UTF16 which is what Microsoft Windows uses, and that is usuasly what wstring contains.

So, the right way to convert from UTF8 to UTF16:

     std::string utf8String = "blah blah";

     std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
     std::wstring utf16String = convert.from_bytes( utf8String );

And the other way around:

     std::wstring utf16String = "blah blah";

     std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
     std::string utf16String = convert.to_bytes( utf16String );

And to add to the confusion:
When you use std::string on a windows platform (like when you use a multibyte compilation), It's NOT UTF8. They use ANSI.
More specifically, the default encoding language your windows is using.

When compiling in Unicode the windows API commands expect these formats:

CommandA - multibyte - ANSI
CommandW - Unicode - UTF16

1
votes

Make your source files UTF-8 encoded, set the character encoding to UNICODE in your IDE.
Use std::string and widen them for WindowsAPI calls.
std::string somestring = "こんにちは"; WindowsApiW(widen(somestring).c_str());
I know it sounds kind of hacky but a more profound explaination of this issue can be found at utf8everywhere.org.