Visual C++: Migrating traditional C and C++ string code to a Unicode world

votes

I see that Visual Studio 2008 and later now start off a new solution with the Character Set set to Unicode. My old C++ code deals with only English ASCII text and is full of:

Literal strings like "Hello World"
char type
char * pointers to allocated C strings
STL string type
Conversions from STL string to C string and vice versa using STL string constructor (which accepts const char *) and STL string.c_str()
1. What are the changes I need to make to migrate this code so that it works in an ecosystem of Visual Studio Unicode and Unicode enabled libraries? (I have no real need for it work with both ASCII and Unicode, it can be pure Unicode.)
2. Is it also possible to do this in a platform independent way? (i.e., by not using Microsoft types.)

I see so many wide character and Unicode types and conversions scattered around, hence my confusion. (Ex: wchar_t, TCHAR, _T, _TEXT, TEXT etc.)

c++c unicodestring

Take a look at this article - joelonsoftware.com/articles/Unicode.html - some good background knowledge on Unicode in there. – AAT

6 Answers

votes

I recommend very much against L"", _T(), std::wstring (the latter is not multiplatform) and Microsoft recommendations on how to do Unicode.

There's a lot of confusion on this subject. Some people still think Unicode == 2 byte characters == UTF-16. Neither equality is correct.

In fact, it's possible, and even better to stay with char* and the plain std::string, plain literals and change very little (and still fully support Unicode!).

See my answer here: https://stackguides.com/questions/1049947/should-utf-16-be-considered-harmful/1855375#1855375 for how to do it the easiest (in my opinion) way.

votes

Note: Wow... Apparently, SOMEONE decided that ALMOST all answers deserved a downmod, even when correct... I took upon myself of upmoding them to balance the downmod...

Let's see if I have my own downmod... :-/

Edit : REJOICE!!!

Nine hours ago, someone (probably the one who downvoted every answer but Pavel Radzivilovsky's one) downvoted this answer. Of course, without any comment pointing to what's wrong with my answer.

\o/

1 - How to migrate on Windows Unicode?

What are the changes I need to make to migrate this code so that it works in an ecosystem of Visual Studio Unicode and Unicode enabled libraries? (I have no real need for it work with both ASCII and Unicode, it can be pure Unicode.)

1.a - My codebase is large, I can't do it in one step!

Let's imagine you want to do it gradually (because your app is not small).

I had the same problem in my team: I wanted to produce Unicode ready code coexisting with code that was not Unicode ready.

For this, you must use MS' header tchar.h, and uses its facilities. Using your own examples:

"Hello World" ----> _T("Hello World")
char type ----> TCHAR type
char * pointers to allocated C strings ----> TCHAR * pointers
std::string type ---> This is tricky because you must create your own std::tstring
remember that sizeof(char) can be different from sizeof(TCHAR), so update your mallocs and new[], too

1.b - Your own `tstring.hpp` header

To handle the STL with my compiler (at that time, I was working on Visual C++ 2003, so your mileage could vary), I have to provide a tstring.hpp header, which is both cross platform and enable the user to use tstring, tiostream, etc.. I can't put the complete source here, but I will give an extract that will enable your to produce your own:

namespace std
{

#ifdef _MSC_VER

#ifdef UNICODE
typedef             wstring                         tstring ;
typedef             wistream                        tistream ;
// etc.
#else // Not UNICODE
typedef             string                          tstring ;
typedef             istream                         tistream ;
// etc.
#endif

#endif

} // namespace std

Normally, it is not authorized to pollute the std namespace, but I guess this is Ok (and it was tested Ok).

This way, you can prefix most STL/C++ iostreams construct with t and have it Unicode ready (on Windows).

1.c - It's done!!!

Now, you can switch from ANSI mode to UNICODE mode by defining the UNICODE and _UNICODE defines, usually in the project settings (I remember on Visual C++ 2008 that there are entries in the first settings pages exactly for that).

My advice is, as you probably have a "Debug" and a "Release" mode on you Visual C++ project, to create a "Debug Unicode" and "Release Unicode" mode derived from them, where the macros described above are defined.

Thus, you'll be able to produce ANSI and UNICODE binaries.

1.d - Now, everything is (or should be) Unicode!

If you want your app to be cross-platform, ignore this section.

Now, either you can modify all your codebase in one step, or you already converted all your codebase to use the tchar.h features described above, you can now remove all macros from your code:

_T("Hello World") ----> L"Hello World"
TCHAR type ----> wchar_t type
TCHAR * pointers to allocated C strings ----> wchar_t * pointers
std::tstring type ---> std::wstring type, etc.

1.e - Remember UTF-16 glyphs can be 1 or 2 wchar_t wide on Windows!

One common misconception on Windows is to believe on wchar_t character is one Unicode glyph. This is wrong, as some Unicode glyphs are represented by two wchar_t.

So, any code that relies on one char being one glyph will potentially break if you uses Unicode glyphs not from the BMP.

2 - Doing it cross platform?

Is it also possible to do this in a platform independent way? (i.e., by not using Microsoft types.)

Now, this was the tricky part.

Linux (I don't know for other OSes, but it should be easy to infer from either the Linux or the Windows solution) is now Unicode ready, the char type supposed to contain an UTF-8 value.

This means that your app, once compiled, for example, on my Ubuntu 10.04, is by default Unicode.

2.a - Remember UTF-8 glyphs can be 1, 2, 3 or 4 char wide on Linux!

Of course, the advice above on UTF-16 and wide chars is even more critical here:

An Unicode glyph can need from 1 to 4 char characters to be represented. So any code you use that relies on the assumption that every char is an intependant Unicode character will break.

2.b - There is no `tchar.h` on Linux!

My solution: Write it.

You only need to define the 't' prefixed symbols to map over the normal ones, as shown in this extract:

#ifdef __GNUC__

#ifdef  __cplusplus
extern "C" {
#endif

#define _TEOF       EOF

#define __T(x)      x

// etc.
#define _tmain      main

// etc.

#define _tprintf    printf
#define _ftprintf   fprintf

// etc.

#define _T(x)       __T(x)
#define _TEXT(x)    __T(x)

#ifdef  __cplusplus
}
#endif

#endif // __GNUC__

... and include it on Linux instead of including the tchar.h from Windows.

2.c - There is no `tstring` on Linux!

Of course, the STL mapping done above for Windows should be completed to handle Linux' case:

namespace std
{

#ifdef _MSC_VER

#ifdef UNICODE
typedef             wstring                         tstring ;
typedef             wistream                        tistream ;
// etc.
#else // Not UNICODE
typedef             string                          tstring ;
typedef             istream                         tistream ;
// etc.
#endif

#elif defined(__GNUC__)
typedef             string                          tstring ;
typedef             istream                         tistream ;
// etc.
#endif

} // namespace std

Now, you can use _T("Hello World") and std::tstring on Linux as well as Windows.

3 - There must be a catch!

And there is.

First, there is the problem of the pollution of the std namespace with your own t prefixed symbols, which is supposed to be forbidden. Then, don't forget the addition on macros, which will pollute your code. In the current case, I guess this is Ok.

Two, I supposed you were using MSVC on Windows (thus the macro _MSC_VER) and GCC on Linux (thus the macro __GNUC__). Modify the defines if your case is different.

Three, your code must be Unicode neutral, that is, you must no rely on your strings to be either UTF-8 or UTF-16. In fact, your source should be empty of anything but ASCII chars to remain cross-platform compatible.

This means that some features, like searching for the presence of ONE Unicode Glyph must be done by a separate piece of code, which will have all the #define needed to make it right.

For example, searching the character é (Unicode Glyph 233) would need you to search for the first character 233 when using UTF-16 wchar_t on Windows, and the first sequence of two characters 195 and 169 on UTF-8 char. This means you must either use some Unicode library to do it, or write it yourself.

But this is more an issue of Unicode itself than Unicode on Windows or on Linux.

3.a - But Windows is supposed to not handle UTF-16 correctly

So what?

The "canonical" example I saw described was the EDIT Win32 control which is supposed to be unable to backspace correctly a non-BMP UTF-16 char on Windows (Not that I did not verify the bug, I just don't care enough).

This is a Microsoft issue. Nothing you'll decide in your code will change the fact this bug exist or not in the Win32 API. So using UTF-8 chars on Windows won't correct the bug on the EDIT control. The only thing you can hope to do is to create your own EDIT control (subclass it and handle the BACKSPACE event correctly?) or your own conversion functions.

Don't mix two different problems, that is: a supposed bug in the Windows API and your own code. Nothing in your own code will avoid the bug in the Windows API unless you do NOT use the supposed bugged Windows API.

3.b - But UTF-16 on Windows, UTF-8 on Linux, isn't that complicated?

Yes, it could lead to bugs on some platform that won't happen on others, if you assume too much about characters.

I assumed your primary platform was Windows (or that you wanted to provide a library for both wchar_t and char users).

But if this is not the case, if Windows is not your primary platform, then there is the solution of assuming all your char and std::string will contain UTF-8 characters, unless told different. You'll need, then, to wrap APIs to make sure that your char UTF-8 string will not be mistaken for an ANSI (or other codepaged) char string on Windows. For example, the name of the files for the stdio.h and iostream libraries will be assumed to be codepaged, as well as the ANSI version of the Win32 API (CreateWindowA, for example).

This is the approach of GTK+ which uses UTF-8 characters, but not, surprisingly, of QT (upon which Linux KDE is built) which uses UTF-16.

Source:

Still, it won't protect you from the "Hey, but Win32 edit controls don't handle my unicode characters!" problem, so you'll still have to subclass that control to have the desired behaviour (if the bug still exists)...

Appendix

See my answer at std::wstring VS std::string for a complete difference between std::string and std::wstring.

votes

"Hello World" -> L"Hello World"

char -> wchar_t (unless you actually want char)

char * -> wchar_t *

string -> wstring

These are all platform independent. However, be aware that a wide character may be different on different platforms (two bytes on windows, four bytes on others).

Define UNICODE and _UNICODE in your project (in Visual Studio you can do this by setting the project to use Unicode in the settings). This also makes _T, TCHAR, _TEXT and TEXT macros to become L automatically. These are Microsoft specific, so avoid these if you want to be cross-platform.

votes

I would suggest not to worry about supporting both ascii and unicode build (a-la TCHAR) and go stright to unicode. That way you get to use more of the platform independant functions (wcscpy, wcsstr etc) instead of relying onto TCHAR functions which are Micrpsoft specific.

You can use std::wstring instead of std::string and replace all chars with wchar_ts. With a massive change like this I found that you start with one thing and let the compiler guide you to the next.

One thing that I can think of that might not be obvious at run time is where a string is allocated with malloc without using sizeof operator for the underlying type. So watch out for things like char * p = (char*)malloc(11) - 10 characters plus terminating NULL, this string will be half the size it's supposed to be in wchar_ts. It should become wchar_t * p = (wchar_t*)malloc(11*sizeof(wchar_t)).

Oh and the whole TCHAR is to support compile time ASCII/Unicode strings. It's defined something like this:

#ifdef _UNICODE
#define _T(x) L ## x
#else
#define _T(x) ## x
#endif

So that in unicode configuration _T("blah") becomes L"blah" and in ascii configuration it's "blah".

votes

Your question involves two different but related concepts. One of them is the encoding of the string (Unicode/ASCII, for example). The other is the data type to be used for the character representation.

Technically, you can have an Unicode application using plain char and std::string. You could use literals in hexadecimal ("\x5FA") or octal ("\05FA") format to specify the byte sequence of the string. Notice that with this approach your already existent string literals that contain ASCII characters should remain valid, since Unicode preserves the codes from ASCII.

One important point to observe is that many string related functions would need to be used carefully. This is because they'll be operating on bytes rather than characters. For example, std::string::operator[] might give you a particular byte that is only part of an Unicode character.

In Visual Studio wchar_t was chosen as the underlying character type. So if you're in working with Microsoft based libraries things should get easier for you if you follow many of the advices posted by others here. Replacing char for wchar_t, using the "T" macros (if you want to preserve transparency between Unicode/non-Unicode), etc.

However, I don't think there is a de facto standard of working with Unicode across libraries, since they might have different strategies to handle it.

-4

votes

Around your literal constants with _T(), e.g. _T("Hello world")
Replace char with macros CHAR
Replace string with wstring

Then all should work.

Visual C++: Migrating traditional C and C++ string code to a Unicode world

6 Answers

Edit : REJOICE!!!

1 - How to migrate on Windows Unicode?

1.a - My codebase is large, I can't do it in one step!

1.b - Your own tstring.hpp header

1.c - It's done!!!

1.d - Now, everything is (or should be) Unicode!

1.e - Remember UTF-16 glyphs can be 1 or 2 wchar_t wide on Windows!

2 - Doing it cross platform?

2.a - Remember UTF-8 glyphs can be 1, 2, 3 or 4 char wide on Linux!

2.b - There is no tchar.h on Linux!

2.c - There is no tstring on Linux!

3 - There must be a catch!

3.a - But Windows is supposed to not handle UTF-16 correctly

3.b - But UTF-16 on Windows, UTF-8 on Linux, isn't that complicated?

Appendix

1.b - Your own `tstring.hpp` header

2.b - There is no `tchar.h` on Linux!

2.c - There is no `tstring` on Linux!