3
votes

Here is the program: http://codepad.org/eyxunHot
The encoding of the file is UTF-8.

I have a text file named "config.ini" with the following word in it: ➑ball

If I use notepad to save the file with "UTF-8" encoding, then run the program, according to the debugger the value of eight_ball is: âball

If I use notepad to save the file with "Unicode" encoding, then run the program, according to the debugger the value of eight_ball is: ÿþ'b

If I use notepad to save the file with "Unicode big endian" encoding, then run the program, according to the debugger the value of eight_ball is: þÿ'

In all these cases the result is incorrect. Also ANSI encoding doesn't support the ➑ symbol. How do I make sure that the word ➑ball will be extracted from the file when I go config_file >> eight_ball, regardless of encoding? I want the output of this program to be "Program is correct" regardless of the encoding of config.ini.

3
Note that your problem is fundamentally unsolvable. If I save a Latin-1 file with contents "âball" (8 valid characters), there is no way to distinguish that from an UTF-8 file containing ➑ball (5 valid characters). They're the same 8 bytes.MSalters

3 Answers

1
votes

If you're under Windows and you want to use INI files, keep in mind that the INI APIs support Unicode (UTF-16 little endian) INI files without problems, you just have to provide the empty file with the BOM at the beginning.

By the way, if you want to work with C++ streams and Unicode you may want to look at this article. Besides of the UTF8 thing, you'll learn also how character conversion works under the hood in C++ streams.

1
votes

Maybe you can yse ICU library.

Windows has many problems with UTF supports. My Ubuntu uses default UTF-8 encodings and this problem solved, but Unix like OS has some strange realization of C++ standart library. I mean using char* for holding UTF-8 text (it use 2 cells of array on letter). But with string class it cleans.

0
votes

You need to set the locale before wstreams will work correctly. I would instead suggest using regular streams and some library for character conversion, as your input encoding typically will differ anyway. The best algorithm these days is to try reading as UTF-8 first and if that fails, try reading as CP1252 or some other user-configurable legacy charset.