Extract correct text from a wifstream regardless of encoding

Question

Here is the program: http://codepad.org/eyxunHot
The encoding of the file is UTF-8.

I have a text file named "config.ini" with the following word in it: ➑ball

If I use notepad to save the file with "UTF-8" encoding, then run the program, according to the debugger the value of eight_ball is: ï»¿âball

If I use notepad to save the file with "Unicode" encoding, then run the program, according to the debugger the value of eight_ball is: ÿþ'b

If I use notepad to save the file with "Unicode big endian" encoding, then run the program, according to the debugger the value of eight_ball is: þÿ'

In all these cases the result is incorrect. Also ANSI encoding doesn't support the ➑ symbol. How do I make sure that the word ➑ball will be extracted from the file when I go config_file >> eight_ball, regardless of encoding? I want the output of this program to be "Program is correct" regardless of the encoding of config.ini.

Note that your problem is fundamentally unsolvable. If I save a Latin-1 file with contents "ï»¿âball" (8 valid characters), there is no way to distinguish that from an UTF-8 file containing ➑ball (5 valid characters). They're the same 8 bytes. — MSalters

Matteo Italia Matteo Italia · Accepted Answer · 2010-02-14T10:47:58

If you're under Windows and you want to use INI files, keep in mind that the INI APIs support Unicode (UTF-16 little endian) INI files without problems, you just have to provide the empty file with the BOM at the beginning.

By the way, if you want to work with C++ streams and Unicode you may want to look at this article. Besides of the UTF8 thing, you'll learn also how character conversion works under the hood in C++ streams.

Extract correct text from a wifstream regardless of encoding

3 Answers