6
votes

I am modernizing a large, legacy MFC codebase which contains a veritable medley of string types:

  • CString
  • std::string
  • std::wstring
  • char*
  • wchar_t*
  • _bstr_t

I'd like to standardize on a single string type internally, and convert to other types only when absolutely required by a third-party API (i.e. COM or MFC functions). The question my coworkers and I are debating; which string type should we standardize on?

I would prefer one of the C++ standard strings: std::string or std::wstring. I'm personally leaning toward std::string, because we do not have any need for wide characters - it is an internal codebase with no customer-facing UI (i.e. no need for multiple-language support). "Plain" strings allow us to use simple, unadorned string literals ("Hello world" vs L"Hello world" or _T("Hello world")).

Is there an official stance from the programming community? When faced with multiple string types, what is typically used as the standard 'internal' storage format?

2
Windows internally is UTF-16LE so std::wstring is a good fit for that platform; so is std::vector<wchar_t>.Richard Critten
For a Windows application use std::wstring. With narrow strings you'd need conversions all over the place. Note: since you don't already know this, you're not a good choice for person to do the job, it's basics. That choice is your manager's fault.Cheers and hth. - Alf
Re _T("Hello world"), the T macros were obsoleted in the year 2000 by the introduction of Layer for Unicode, and today our tools can't produce executables for the Windows versions (9x) that these macros target. I understand it's a legacy codebase. But when your task is to clean it up, mentioning T macros as convenient is absurd and very counter-productive.Cheers and hth. - Alf
If you choose narrow chars then all you need to break your program is one employee with a non-latin name and you hit encoding problems for the user and below directories.Richard Critten
@BTownTKD; Your statement "Windows provides narrow-char alternatives for nearly all APIs" is based on full ignorance. The narrow functions do conversion to/from Windows ANSI, which is (1) system specific, and (2) unable to represent e.g. all filesystem paths. Also, many APIs, especially newer ones, have no ANSI wrappers.Cheers and hth. - Alf

2 Answers

7
votes

If we talk about Windows, than I'd use std::wstring (because we often need cool string features), or wchar_t* if you just pass strings around.

Note Microsoft recommends that here: Working with Strings

Windows natively supports Unicode strings for UI elements, file names, and so forth. Unicode is the preferred character encoding, because it supports all character sets and languages. Windows represents Unicode characters using UTF-16 encoding, in which each character is encoded as a 16-bit value. UTF-16 characters are called wide characters, to distinguish them from 8-bit ANSI characters. The Visual C++ compiler supports the built-in data type wchar_t for wide characters

Also:

When Microsoft introduced Unicode support to Windows, it eased the transition by providing two parallel sets of APIs, one for ANSI strings and the other for Unicode strings. [...] Internally, the ANSI version translates the string to Unicode.

Also:

New applications should always call the Unicode versions. Many world languages require Unicode. If you use ANSI strings, it will be impossible to localize your application. The ANSI versions are also less efficient, because the operating system must convert the ANSI strings to Unicode at run time. [...] Most newer APIs in Windows have just a Unicode version, with no corresponding ANSI version.

1
votes

It depends.

When programming for Windows, I recommend to use std::wstring at least for:

  • Resources (Strings, Dialogs, etc.)
  • Filesystem access (Windows allows non-ASCII characters in file and directory names (that includes all the "wrong kinds of apostrophes" btw), these are impossible to open using ANSI API)
  • COM (a BSTR is always wide character)
  • Other user-facing interfaces (clipboard, system error reporting, etc)

However, it is easier to handle internal ASCII data files and UTF-8-encoded-data using single-character strings. It's fast, efficient and straightforward.

There may also be other aspects that are not mentioned in the question, such as databases or APIs used, input/output files, etc. and their charsets - all of those play a role when deciding on the best data structures for the job.

"UTF-8 everywhere" is a sound idea in general. But there is 0 Windows API that takes UTF-8. Even the std::experimental::filesystem API uses std::wstring on Windows and std::string on POSIX.