I am trying to convert string encoded in ISO-8859-1 to UTF-8 on Linux. I am using iconv function to do that in C++. This is the code that I have:
//Conversion from ISO-8859-1 to UTF-8
iconv_t cd = iconv_open("UTF-8","ISO-8859-1");
char *input = "€"; // the byte value is 128 in ISO-8859-1
char *inputbuf= input;
size_t inputSize=1;
char *output = (char*)malloc(inputSize*4); // maximum size of a character in UTF8 is 4
char *outputbuf = output;
size_t outputSize = inputSize*4;
//Conversion Function
iconv (cd, &inputbuf, &inputSize, &outputbuf, &outputSize);
//Display input bytes(ISO-8859-1)
cout << "input bytes(ISO-8859-1):"
for (int i=0; i<inputSize; i++)
{
cout <<(int) *(input+i) << ", ";
}
cout<< std::endl;
//Display Converted bytes(UTF-8)
cout << "output bytes(UTF-8):"
for (int i=0; i<outputSize; i++) //displaying all the 4 bytes allocated
{
cout <<(int) *(output+i) << ", ";
}
cout<< std::endl;
iconv(cd);
This is the output I observe:
input bytes(ISO-8859-1): 128
output bytes(UTF-8): 194, 128, 0, 0
As you can see, the output UTF-8 converted bytesis 194,128. However, the expected UTF-8 output is 226,130,172. I verified that there is no error thrown by any of the iconv functions.
Can anyone please help me figure out if I am missing anything here?
€
is NOT byte 128 (0x80) in ISO-8859-1. In fact, byte 0x80 is unassigned in ISO-8859-1. You are thinking of Windows-1252 (or other similar charset), which does have€
in byte 0x80 (it is not always 0x80 in all supporting charsets, though). Windows-1252 is commonly mistaken for ISO-8859-1. – Remy Lebeau€
as byte 164 (0xA4). – Remy Lebeau