Inside the application, you have two basic models:
- Use UTF-16 throughout your application.
- use UTF-8 strings throughout, and convert from/to UTF-16 at Win32 API / MFC / ... calls
The first might be an issue if you are going to heavily use libraries that don't support UTF-16. I've never found that to be an issue in practice. Some people will tell you that you are stupid and your product is doomed based solely on the fact that you are using UTF-16, but I've never found that to be an issue in practice either.
If you give in to peer pressure, or depend on existing UTF-8-centric code, using UTF-8 internally can be simplified when using a custom wrapper class for your strings that converts to / from CString, plus some helper classes to deal with [out] CString *
/ CString &
). For non-MFC non-CString code, std::vector<TCHAR>
would be a good representation. That wrapper should of course not convert implicitely to/from char * or wchar_t *.
The files you read and write:
As long as they are "your" application files, you can do whatever you want. In fact, using an opaque (binary) format might isolate you completely from user issues. Just be consistent.
The problems arise when you start to process files from other applications, or users can be expected to edit your application's text files with other applications. This is where it starts to become bleak. Since UTF-8 support has been very limited for many years, many tools can't cope well with that. Other programs do recognize and interpret UTF-8 correctly, but fail to skip any BOM marker present.
Still, UTF-8 is the "safe bet for the future". Even if it's more upfront development, I'd strongly recommend to use it for shared files.
Our solution, after some back and forth, is the following:
Reading Text Files, the default algorithm is:
- probe for BOM. If any is present, rely on BOM (but of course skip it)
- probe for valid UTF-16 (we even support LE/BE, though BE is unlikely to appear).
- probe for ASCII only (all bytes <= 127). If so, interpret as ASCII
- probe for UTF-8. If the body would be valid UTF-8, read as UTF-8
- otherwise fall back to current code page
UTF-8 was specifically designed so that any other encoding is actually valid UTF-8 is very very low. This makes the order of the last two steps fairly safe.
Writing text files, we use UTF-8 without BOM. From a short, informat survey of the external tools we use, this is the safest bet.
Based on that, we also have included a utility that lest our devs and users detect and convert non-UTF-8 text files to UTF-8.