3
votes

We are converting Windows code from legacy character sets to Unicode. Our GUI code uses MFC but we also have a lot of non-GUI modules that will be incorporated into a non-MFC environment.

Is UTF-8 the most futureproof way to save data files?

Windows system calls have to use wide character strings, otherwise they will be interpreted in a legacy code page. Is it better to use wide character strings (compatible with system calls and MFC) or UTF-8 (compatible with data files if we go that way) for general strings within the program?

How can we minimise the risk of UTF-8 strings being interpreted as being in legacy code pages? We have had cross-code page problems with overseas users in the past and getting away from this is one of our motives for moving to full Unicode.

5
You don't have any choice in the matter regarding the internal encoding used by your program. At least the parts that call Win32 APIs. You have to UTF-16. As for the external data files, that's entirely your choice. UTF-8 is often a good choice. But it all depends on your needs.David Heffernan
I would like to say the same as @DavidHeffernan here, except I would like to be the first to say it, so that he would be appearing to repeat my statement instead of opposite. Since it's objectively the best approach. Thank you.Cheers and hth. - Alf
@Cheersandhth.-Alf You made me chuckle there!David Heffernan
@jalf: well it's not entirely separate, it's just well separated. in particular there's an impact on efficiency when pass strings into an API function or a library function that's adapated to Windows, and there's an impact on complexity when handling strings produced by such API or general library functions. For the purpose of porting original Unix-land code this impact may count for less than the impact of wholesale code modification. But when the task is to convert up legacy code, one is free to choose a way that is both efficient of and of low complexity (more bug-free, less work).Cheers and hth. - Alf

5 Answers

2
votes

Unfortunately the situation in Windows is kind of ugly. Despite standardizing on Unicode internally, text files are still interpreted using the current code page in many cases.

UTF-8 is a good choice for files because it allows interchanging data between Windows systems that use different languages, plus Linux and its relatives. You can increase the chances of a UTF-8 file being interpreted correctly by putting a Byte order mark (BOM) at the start of the file. It's not a perfect solution; not all programs will recognize it, and it goes against the Unicode Standard recommendations.

The Windows API uses UTF-16 for its Unicode interface. I'd stick with that for internal program usage unless you enjoy swimming against the tide.

2
votes

Inside the application, you have two basic models:

  • Use UTF-16 throughout your application.
  • use UTF-8 strings throughout, and convert from/to UTF-16 at Win32 API / MFC / ... calls

The first might be an issue if you are going to heavily use libraries that don't support UTF-16. I've never found that to be an issue in practice. Some people will tell you that you are stupid and your product is doomed based solely on the fact that you are using UTF-16, but I've never found that to be an issue in practice either.

If you give in to peer pressure, or depend on existing UTF-8-centric code, using UTF-8 internally can be simplified when using a custom wrapper class for your strings that converts to / from CString, plus some helper classes to deal with [out] CString * / CString &). For non-MFC non-CString code, std::vector<TCHAR> would be a good representation. That wrapper should of course not convert implicitely to/from char * or wchar_t *.


The files you read and write:

As long as they are "your" application files, you can do whatever you want. In fact, using an opaque (binary) format might isolate you completely from user issues. Just be consistent.

The problems arise when you start to process files from other applications, or users can be expected to edit your application's text files with other applications. This is where it starts to become bleak. Since UTF-8 support has been very limited for many years, many tools can't cope well with that. Other programs do recognize and interpret UTF-8 correctly, but fail to skip any BOM marker present.

Still, UTF-8 is the "safe bet for the future". Even if it's more upfront development, I'd strongly recommend to use it for shared files.


Our solution, after some back and forth, is the following:

Reading Text Files, the default algorithm is:

  • probe for BOM. If any is present, rely on BOM (but of course skip it)
  • probe for valid UTF-16 (we even support LE/BE, though BE is unlikely to appear).
  • probe for ASCII only (all bytes <= 127). If so, interpret as ASCII
  • probe for UTF-8. If the body would be valid UTF-8, read as UTF-8
  • otherwise fall back to current code page

UTF-8 was specifically designed so that any other encoding is actually valid UTF-8 is very very low. This makes the order of the last two steps fairly safe.

Writing text files, we use UTF-8 without BOM. From a short, informat survey of the external tools we use, this is the safest bet.

Based on that, we also have included a utility that lest our devs and users detect and convert non-UTF-8 text files to UTF-8.

0
votes

I would agree with @DavidHeffernan for the APIs, I would also recommend switching to Unicode completely (we took a deep breath and did that for all our applications, it's a one-time effort that pays off in the long term)

0
votes

As Mark Ransom has already answered, as as David Heffernan and I have already commented, UTF-16 is the practical choice for a Windows program's internals, while UTF-8 is a very good choice for external representation (except for interactive console i/o, which however isn't much of an issue).

Since you are converting up from legacy code I would however like to focus on reusability.

Potentially platform-independent reusable parts can be made really reusable by not blindly using wchar_t directly, but instead e.g. a type Syschar conditionally defined as

enum Syschar: wchar_t {};    // For Windows, implying UTF-16

and as

enum Syschar: char {};       // For Linux-land, implying UTF-8

Use of enum instead of struct ensures that you can use the type to specialize std::basic_string (when you define the proper std::char_traits) even when its implementation uses a union for the short buffer optimization.

As David Wheeler remarked, “All problems in computer science can be solved by another level of indirection” – and this is one them.

0
votes

Is UTF-8 the most futureproof way to save data files?

There's really no reason to use anything else.

Windows system calls have to use wide character strings, otherwise they will be interpreted in a legacy code page.

You can also wrap Win32 API calls with shims that take UTF-8 strings and convert them before calling the UTF-16 native API.

Is it better to use wide character strings (compatible with system calls and MFC) or UTF-8 (compatible with data files if we go that way) for general strings within the program?

That really depends. You don't want to have to scatter conversions all over your code because that's more likely to lead to missed conversions.

If the program has complex internal logic then hopefully you've already organized it so that both the input/ouput code and the code that interacts with the system API are pretty localized and you can choose either route: put conversions on the API usage or put conversions on IO operations. If the system API usage and IO aren't already localized then start by fixing that.

If the program's logic is simple enough that you don't need to localize one or the other then put conversions on whichever one is more localized. You can also refactor the program to make one or the other localized to ease the conversions.

How can we minimise the risk of UTF-8 strings being interpreted as being in legacy code pages? We have had cross-code page problems with overseas users in the past and getting away from this is one of our motives for moving to full Unicode.

Establish consistent standards and enforce them. Require that all non-wchar_t strings be UTF-8 and do not use any first or third party APIs that use legacy encodings. If your toolchain allows you disable APIs (e.g., via a 'deprecated' attribute) then do that for APIs as you find and remove their usages. Ensure that the developers all understand string encodings, and make sure code reviewers watch for encoding mistakes.