MSDN "Support for Unicode" :
A wide character is a 2-byte multilingual character code. Tens of thousands of characters, comprising almost all characters used in modern computing worldwide, including technical symbols and special publishing characters, can be represented according to the Unicode specification as a single wide character encoded by using UTF-16. Characters that cannot be represented in just one wide character can be represented in a Unicode pair by using the Unicode surrogate pair feature. Because almost every character in common use is represented in UTF-16 in a single 16-bit wide character, using wide characters simplifies programming with international character sets. Wide characters encoded using UTF-16LE (for little-endian) are the native character format for Windows.
But, compiled with /utf-8:
int wmain(int argc, wchar_t * argv[])
{
wchar_t * wstr = L"ä ∫";
for(int i=0; i < wcslen(wstr); i++) std::cout << std::hex << wstr[i] << " | ";
std::cout << std::endl;
for(int i=0; i < wcslen(wstr); i++) std::cout << std::bitset<8>(wstr[i] >> 8) << " " << std::bitset<8>(wstr[i]) << " | ";
return 0;
}
Return:
e4 | 20 | 222b |
00000000 11100100 | 00000000 00100000 | 00100010 00101011 |
Where ä is encoded as 00000000 11100100 - it's utf-16BE.
Where ∫ is encoded as 00100010 00101011 - it's utf-16BE.
Where I am wrong? What I missed?
cout's output would reflect the internal representation. It doesn't. If you want to output bytes in the order they are stored in memory, cast tochar const*and output the individual bytes. - IInspectablewstraddress and there is UTF-16LEe4 00 20 00 2b 22. - Sonny Dwstr[i] >> 8operates on the 16-bit value 0x0000000011100100, regardless of the endianness of bytes in memory, andstd::bitsetis defined in terms of bitwise operations. - Eryk Sun