Flow of raw bytes of string literal into/out of the Windows (non-wide) execution character set at compile/runtime, & ANSI code pages vs. UTF-8

Question

I would like confirmation regarding my understanding of raw string literals and the (non-wide) execution character set on Windows.

Relevant paragraphs for which I desire specific confirmation are in BOLD. But first, some background.

BACKGROUND

(relevant questions are in the paragraphs below in bold)

As a result of the helpful discussion beneath @TheUndeadFish's answer to this question that I posted yesterday, I have attempted to understand the rules determining the character set and encoding used as the execution character set in MSVC on Windows (in the C++ specification sense of execution character set; see @DietmarKühl's posting).

I suspect that some might consider it a waste of time to even bother trying to understand the ANSI-related behavior of char * (i.e., non-wide) strings for non-ASCII characters in MSVC.

For example, consider @IInspectable's comment here:

You cannot throw a UTF-8 encoded string at the ANSI version of a Windows API and hope for anything sane to happen.

Please note that in my current i18n project on a Windows MFC-based application, I will be removing all calls to the non-wide (i.e., ANSI) versions of API calls, and I expect the compiler to generate execution wide-character set strings, NOT execution character set (non-wide) strings internally.

However, I want to understand the existing code, which already has some internationalization that uses the ANSI API functions. Even if some consider the behavior of the ANSI API on non-ASCII strings to be insane, I want to understand it.

I think like others, I have found it difficult to locate clarified documentation about the non-wide execution character set on Windows.

In particular, because the (non-wide) execution character set is defined by the C++ standard to be a sequence of char (as opposed to wchar_t), UTF-16 cannot be used internally to store characters in the non-wide execution character set. In this day and age, it makes sense that the Unicode character set, encoded via UTF-8 (a char-based encoding), would therefore be used as the character set and encoding of the execution character set. To my understanding, this is the case on Linux. However, sadly, this is not the case on Windows - even MSVC 2013.

This leads to the first of my two questions.

Question #1: Please confirm that I'm correct in the following paragraph.

With this background, here's my question. In MSVC, including VS 2013, it seems that the execution character set is one of the (many possible) ANSI character sets, using one of the (many possible) code pages corresponding to that particular given ANSI character set to define the encoding - rather than the Unicode character set with UTF-8 encoding. (Note that I am asking about the NON-WIDE execution character set.) Is this correct?

BACKGROUND, CONTINUED (assuming I'm correct in Question #1)

If I understand things correctly, than the above bolded paragraph is arguably a large part of the cause of the "insanity" of using the ANSI API on Windows.

Specifically, consider the "sane" case - in which Unicode and UTF-8 are used as the execution character set.

In this case, it does not matter what machine the code is compiled on, or when, and it does not matter what machine the code runs on, or when. The actual raw bytes of a string literal will always be internally represented in the Unicode character set with UTF-8 as the encoding, and the runtime system will always treat such strings, semantically, as UTF-8.

No such luck in the "insane" case (if I understand correctly), in which ANSI character sets and code page encodings are used as the execution character set. In this case (the Windows world), the runtime behavior may be affected by the machine that the code is compiled on, in comparison with the machine the code runs on.

Here, then, is Question #2: Again, please confirm that I'm correct in the following paragraph.

With this continued background in mind, I suspect that: Specifically, with MSVC, the execution character set and its encoding depends in some not-so-easy-to-understand way on the locale selected by the compiler on the machine the compiler is running on, at the time of compilation. This will determine the raw bytes for character literals that are 'burned into' the executable. And, at run-time, the MSVC C runtime library may be using a different execution character set and encoding to interpret the raw bytes of character literals that were burned into the executable. Am I correct?

(I may add examples into this question at some point.)

FINAL COMMENTS

Fundamentally, if I understand correctly, the above bolded paragraph explains the "insanity" of using the ANSI API on Windows. Due to the possible difference between the ANSI character set and encoding chosen by the compiler and the ANSI character set and encoding chosen by the C runtime, non-ASCII characters in string literals may not appear as expected in a running MSVC program when the ANSI API is used in the program.

(Note that the ANSI "insanity" really only applies to string literals, because according to the C++ standard the actual source code must be written in a subset of ASCII (and source code comments are discarded by the compiler).)

The description above is the best current understanding I have of the ANSI API on Windows in regards to string literals. I would like confirmation that my explanation is well-formed and that my understanding is correct.

@MarkTolonen - Thanks! I've read it about 100 times, and in fact I'll be using the approach of utf8everywhere.org to implement my internationalization. Note that utf8everywhere.org recommends using the wide versions of API calls, which I'll be doing (converting to/from UTF-8 only at the point of use). However, the existing application I'm working with uses some narrow API's for some (weak) i18n, and I would (nonetheless) like to understand the non-wide issues. — Dan Nissenbaum
Note 1: Windows generally has two non-Unicode characters sets, one for windowed apps and a different one for console apps in a command window. Note 2: Windows never uses UTF-8 on its own, you need to go out of your way to configure it and the support is reportedly buggy. — Mark Ransom
@Mark: Considering that the entire metadata stored in .NET assemblies is encoded using UTF-8, I'd be somewhat surprised to hear that Windows' UTF-8 support is buggy. — IInspectable

MSalters MSalters · Accepted Answer · 2015-01-10T19:42:30

A very long story, and I have problems finding a single clear question. However, I think I can resolve a number of misunderstandings that led to this.

First of, "ANSI" is a synonym for the (narrow) execution character set. UTF-16 is the execution wide-character set.

The compiler will NOT choose for you. If you use narrow char strings, they are ANSI as far as the compiler (runtime) is aware.

Yes, the particular "ANSI" character encoding can matter. If you compile a L"ä" literal on your PC, and your source code is in CP1252, then that ä character is compiled to a UTF-16 ä. However, the same byte could be another non-ASCII character in other encodigns, which would result in a different UTF-16 character.

Note however that MSVC is perfectly capable of compiling both UTF-8 and UTF-16 source code, as long as it starts with U+FEFF "BOM". This makes the whole theoretical problem pretty much a non-issue.

[edit] "Specifically, with MSVC, the execution character set and its encoding depends..."

No, MSVC has nothing to do with the execution character set, really. The meaning of char(0xE4) is determined by the OS. To see this, check the MinGW compiler. Executables produced by MinGW behave the same as those of MSVC, as both target the same OS.

Flow of raw bytes of string literal into/out of the Windows (non-wide) execution character set at compile/runtime, & ANSI code pages vs. UTF-8

1 Answers