How unicode string(wide string) are encode in Windows? UTF-16LE or UTF-16BE?

Question

A wide character is a 2-byte multilingual character code. Tens of thousands of characters, comprising almost all characters used in modern computing worldwide, including technical symbols and special publishing characters, can be represented according to the Unicode specification as a single wide character encoded by using UTF-16. Characters that cannot be represented in just one wide character can be represented in a Unicode pair by using the Unicode surrogate pair feature. Because almost every character in common use is represented in UTF-16 in a single 16-bit wide character, using wide characters simplifies programming with international character sets. Wide characters encoded using UTF-16LE (for little-endian) are the native character format for Windows.

But, compiled with /utf-8:

int wmain(int argc, wchar_t * argv[])
{
    wchar_t * wstr = L"ä ∫";
    for(int i=0; i < wcslen(wstr); i++) std::cout << std::hex << wstr[i] << " | ";
    std::cout << std::endl;
    for(int i=0; i < wcslen(wstr); i++) std::cout << std::bitset<8>(wstr[i] >> 8) << " " << std::bitset<8>(wstr[i]) << " | ";
    return 0;
}

Return:

e4 | 20 | 222b |
00000000 11100100 | 00000000 00100000 | 00100010 00101011 |

Where ä is encoded as 00000000 11100100 - it's utf-16BE.
Where ∫ is encoded as 00100010 00101011 - it's utf-16BE.

Where I am wrong? What I missed?

"Where I am wrong?" - Precisely where you assume that cout's output would reflect the internal representation. It doesn't. If you want to output bytes in the order they are stored in memory, cast to char const* and output the individual bytes. — IInspectable
@IInspectable Okey, you are right. I inspect memory under wstraddress and there is UTF-16LE e4 00 20 00 2b 22. — Sonny D
@IInspectable So having that, encoding of what we see in code above? — Sonny D
bitshifts are independent of endianness. wstr[i] >> 8 operates on the 16-bit value 0x0000000011100100, regardless of the endianness of bytes in memory, and std::bitset is defined in terms of bitwise operations. — Eryk Sun
@IInspectable can you write your comment as answer? I will accept that. — Sonny D

Sonny D Sonny D · Accepted Answer · 2021-03-09T08:48:45

This answer is copy of @IInspectable first comment to question post:

"Where I am wrong?" - Precisely where you assume that cout's output would reflect the internal representation. It doesn't. If you want to output bytes in the order they are stored in memory, cast to char const* and output the individual bytes.

How unicode string(wide string) are encode in Windows? UTF-16LE or UTF-16BE?

1 Answers