0
votes

I have page, taht loads data from different databases (which could have different charset). Problem is, that it loads with broken charset to UTF-8. And I need to find a way, how to load it propertly.

My connection is:

$db = new PDO("mysql:host=".DBHOST.";dbname=".DBNAME, DBUSER, DBPASS);
$db->setAttribute(PDO::MYSQL_ATTR_INIT_COMMAND, 'SET NAMES utf8'); 
$db->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION); 

as you can see, I use 'SET NAMES utf8'

I have <meta charset="utf-8"> in <head>

And I have tried some conversions:

 error_log("ORGIGINAL: ".$row["title"]);
 error_log("ICONV: ".iconv(mb_detect_encoding($row["title"], mb_detect_order(), true), "UTF-8", $row["title"]));
 error_log("UTF_ENCODE: ".utf8_encode ($row["title"]));

I believe, I have all files loaded in UTF-8 too (re-saved every file in notepad switching from ANSI to UTF-8. then tried this tool for verification https://nlp.fi.muni.cz/projects/chared/)

now, where fun begins: Not only that I got wrong output, I have different output for browser and error log.

Original string stored in DB: http://screenshot.cz/F7/F7XRF/sdb.png

FIREFOX reaction:

Original:

http://screenshot.cz/TG/TG7RX/for.png

utf8_encode:

http://screenshot.cz/H9/H9IZJ/fu.png

iconv: same as utf8_encode

and now, how it was loaded into php error file: http://screenshot.cz/FY/FYXEE/el.png

As you can see, the output has best result in original shape, while if trying to convert, it has more deformed output. Also tried to change error log file charset to UTF-8 (original unknown/ANSI probably), but same shape in both encodings)

The text is central-europe/czech. needed characters are: á é ý í ó ú ů ž š č ř ď ť ň ě

So, any ideas, where can be something wrong?

Thanks :)

1
I have previously written an answer that contains a little checklist, that will cover most of the charset issues in a PHP/MySQL application. There's also a more in-depth topic, UTF-8 All the Way Through. Most likely, you'll find a solution in either one or both of these topics. - Qirel
Have you used any other character sets than utf8? - Rick James

1 Answers

0
votes

Do not use any conversion functions.

There are two causes for black diamonds; see Trouble with utf8 characters; what I see is not what I stored

The error file is exhibiting Mojibake, or possibly "double encoding". Those are also discussed in the link above.

Check that Firefox is interpreting the page as UTF8. Older version did not necessarily assume such.

Oh, I just noticed the plain question mark. (Also covered in the link.) You win the prize for the most number of was to mangle UTF8 in a single file!

This possibly means that there are multiple errors occurring. Good luck. If you provide HEX of the data at various stages (in PHP, in the database table, etc), I may be able to help in more detail.

An issue with the Czech character set is that some characters (those with acute accents) are found in western European subsets, hence are more likely to be rendered correctly. The other accents are mostly specific to Czech (with carons), and go down a different path. This explains why some of your samples exhibit two different failure cases. (Search for Czech on this forum; you may more tips.)

After some experimentation...

?eské probably comes from the CHARACTER SET of the column in the table being latin1 (or other "latin"), plus establishing the connection as being latin1 when inserting the data. That can be seen on the browser when it is in Western mode, not utf8.

?esk� shows up if you do the above and also have latin1 as the connection during selecting. That is visible with the browser set to utf8.

Caveat: The analysis may not be the only way to get what you are seeing.