3
votes

I have a web application that I'm having problems getting Japanese/Chinese characters to display properly. The thing being that i can display these characters properly when I am hard coding them into an HTML document.

Characters such as:

アイヌの工芸 : ペンシルバニア大学考古学人類学博物館ヒラーコレクション

But when I grab them out of this proprietary database it comes out as junk:

ã¢ã¤ãã®å·¥è¸ : ãã³ã·ã«ããã¢å¤§å­¦èå¤å­¦äººé¡å­¦åç©é¤¨ãã©ã¼ã³ã¬ã¯ã·ã§ã³

Now i have the html document encoded in utf-8

<meta http-equiv="content-type" content="text/html; charset=utf-8"/>

The actual html file itself is saved as "Encoded in utf-8" and not ISO-8859-1 or Western Latin etc.

So the weird thing is that when I use iconv to take the junk character string and convert it from utf-8 to ISO-8859-1 it displays correctly.

iconv("UTF-8", "ISO-8859-1//TRANSLIT", $junk_string)

It seems like the junk string is UTF-8 and when I convert the string to ISO-8859-1 it then displays the characters correctly. This doesn't make sense to me at all.

So I sort of have an answer to my problem but I do not know why it works. I thought that having encoding in UTF-8 was supposed to fix this kind of thing. And I am using Verdana but have tried a couple of other fonts with no success. And the weird thing being that I can hard code the characters with no problem into the html page and they display fine. But when get the same data from the database it is displayed as junk without me changing the encoding to ISO-8859-1.

Anyone have any insight here? And instead of doing this to every piece of data gotten from the database is there a way I can change this on the individual page level? I also tried to change the encoding to

<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"/>

And the characters from the database still do not display correctly.

3

3 Answers

3
votes

Just a guess, but when a database is utf8 and the html document is utf8, the problem most likely is the database connection, at least in my experience with MySQL.

For example for MySQL (the old / regular version), you need to set the character set after opening a database:

mysql_set_charset('utf8');

For PDO / MySQL it would be something like:

$db->exec('SET CHARACTER SET utf8');
2
votes

The answer would be you have wrong data in the database. What probably happened is that you did a conversion ISO-8859-1 -> UTF-8 on data that's already in UTF-8. Therefore, doing a conversion UTF-8 -> ISO-8859-1 gives you the original UTF-8 data back.

Make sure you're not calling utf8_encode (which does an ISO-8859-1 -> UTF-8 conversion) on UTF-8 data!

Since every UTF-8 string is also a valid ISO-8859-1 string (well, not quite, but it's commonly extended so that that's the case), you have no errors on the ISO-8859-1 -> UTF-8 conversion over UTF-8 data.

0
votes

This might be because PHP does not deal with UTF-8 natively:

A string is series of characters, where a character is the same as a byte. This means that PHP only supports a 256-character set, and hence does not offer native Unicode support.

Source: http://php.net/manual/en/language.types.string.php

So when receiving the UTF-8 encoded data from your database, you either want to:

  • Transcode your data to single byte encoded string for native suport (with utf8_decode or iconv), BUT you may loose characters (in your case a lot...)

  • Or manipulate your data with the bunch of functions offered by PHP to deal with Multibyte string