3
votes

I am trying to convert string encoded in ISO-8859-1 to UTF-8 on Linux. I am using iconv function to do that in C++. This is the code that I have:

//Conversion from ISO-8859-1 to UTF-8
iconv_t cd = iconv_open("UTF-8","ISO-8859-1");

char *input = "€"; // the byte value is 128 in ISO-8859-1
char *inputbuf= input;
size_t inputSize=1;

char *output = (char*)malloc(inputSize*4); // maximum size of a character in UTF8 is 4
char *outputbuf = output;
size_t outputSize = inputSize*4;

//Conversion Function
iconv (cd, &inputbuf, &inputSize, &outputbuf, &outputSize);

//Display input bytes(ISO-8859-1)
cout << "input bytes(ISO-8859-1):"
for (int i=0; i<inputSize; i++)
{
    cout <<(int) *(input+i) << ", ";
}
cout<< std::endl;

//Display Converted bytes(UTF-8)
cout << "output bytes(UTF-8):"
for (int i=0; i<outputSize; i++) //displaying all the 4 bytes allocated
{
    cout <<(int) *(output+i) << ", ";
}
cout<< std::endl;
iconv(cd);

This is the output I observe:

input bytes(ISO-8859-1): 128
output bytes(UTF-8): 194, 128, 0, 0

As you can see, the output UTF-8 converted bytesis 194,128. However, the expected UTF-8 output is 226,130,172. I verified that there is no error thrown by any of the iconv functions.

Can anyone please help me figure out if I am missing anything here?

2
According to this table, the code 128 is undefined in the ISO 8859-1 code page.Mr.C64
is NOT byte 128 (0x80) in ISO-8859-1. In fact, byte 0x80 is unassigned in ISO-8859-1. You are thinking of Windows-1252 (or other similar charset), which does have in byte 0x80 (it is not always 0x80 in all supporting charsets, though). Windows-1252 is commonly mistaken for ISO-8859-1.Remy Lebeau
@YSC: ISO-8859-15 encodes as byte 164 (0xA4).Remy Lebeau

2 Answers

0
votes

You can either use the utfcpp library: http://utfcpp.sourceforge.net/ or Boost.Locale for that purpose

-1
votes

This is a bug of iconv, as 0xc2 0x80 is a valid utf-8 sequence for the code point U+0080 glyph <control>.

This glyph is often mistaken for the glyph EURO SIGN, code point U+20AC encoded as 0xe2 0x82 0xac in UTF-8.