4
votes

I have a PHP script that imports and parses XML files and saves the data into the database:

  • Database collation: utf8_general_ci, charset: utf8
  • Page's charset : utf-8
  • XML files: ANSI, contains smart quotes (from MS Word)

So during import I do a utf8_encode() on the text from the XML files prior to saving into the database and subsequently displaying on the page.

But when successfully imported, and saved into DB,

  • Database: smart quotes are saved as ? character (viewed from CMD)
  • Page: smart quotes are displayed as boxes

Any ideas as to why the smart quotes are not being converted correctly, even when using utf8_encode()?

EDIT:

@Tomalak: The XML files are actually .txt, no XML declaration (<?xml ... ?>), and no root element. My script actually adds a root element just so the parser works:

utf8_encode('<article>' . file_get_contents($xmlfile) . '</article>');

Seems like I need to add an XML declaration..? If so, how should it look like?

3
Can you post the XML declaration (<?xml ... ?>) of your XML files, along with the character code (use a hex editor) the smart quotes have there? - Tomalak
@Tomalak, I edited my question to answer your comment - Obay
There you have your error. Add an XML declaration <?xml encoding="Windows-1252"?><article> . file_get_contents($xmlfile) . </article> and remove the utf8_encode() part. Then parse the resulting string with DOMDocument. Just make sure that the encoding declaration matches the bytes in the text file. (At least I suppose it should work this way.) - Tomalak
@Tomalak: if you could rewrite that comment as an answer, I would tick it as the accepted answer :) One question though, how did you know "windows-1252" should be used? Because I tried "iso-8859-1" and it didn't work (I opened the XML (txt) files in Firefox which reported the file as iso-8859-1) - Obay
@Tomalak: also, how come there is no need to convert it to utf8 using utf8_encode, since the database is in utf8? - Obay

3 Answers

8
votes

If your XML string (i.e. file contents) is not encoded as UTF-8, you need an XML declaration that denotes the file encoding. If an XML declaration is missing, the parser will assume UTF-8.

As long as you do not use "special" characters (i.e. anything outside of the ASCII range), it will work without a declaration even if your file is not really UTF-8-encoded. This is because UTF-8 is byte-compatible to ASCII. But as soon as characters are used that are on one of the code pages — like the "smart quotes" — it will break because these are represented by different bytes in UTF-8.

In your case there are text files in a legacy encoding that you wrap with a root element to turn them into well-formed XML. Therefore you need to add the XML declaration yourself:

'<?xml encoding="Windows-1252"?><article>'.file_get_contents($xmlfile).'</article>'

This way you instruct the DOMDocument how to interpret the bytes in your string. I assumed Windows-1252 for you because you said ANSI and mentioned the curly quotes.

In fact, 95% of the time this is what people really mean, even on Linux and even if they say ISO-8859-1 (or latin-1), which is almost, but not exactly the same thing.

To be extra sure you can open your text files in a hex editor, spot a few special characters and compare their byte values with the suspected encoding. For Windows-1252. For the curly quotes the expected byte values would be:

  • 147 (0x93)
  • 148 (0x94)

Once the meaning of the individual bytes in your string is declared, DOMDocument can make sense of them and does the right thing.

When it comes to in the DB, I strongly suspect there is some automagic encoding conversion going on. I admit that I don't know enough about PHP/mySQL/Unicode integration to say for sure.

1
votes

They are being converted correctly, because utf8_encode does exactly what it's supposed to be doing and removes them from the input stream.

utf8_encode converts text in ISO-8859-1 into UTF-8 encoding. ISO-8859-1 does not contain the smartquote characters. Your webpage is probably being treated as MSWIN1252 encoding in the browser, which does contain smartquote characters, and you're forcing PHP's input into ISO-8859-1 by calling utf8_encode on it. This means that the only characters that will make it into your database are these ones: http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout

To solve this, you have to make sure that the webpage itself is in UTF-8 encoding, and remove all the utf8_encode/utf8_decode statements from your code. This page explains how to properly work with UTF-8: http://malevolent.com/weblog/archive/2007/03/12/unicode-utf8-php-mysql/

0
votes

Yeah, the problem is that there are two types of smart quotes.

windows-1252, and latin-1

utf8_encode assumes it's latin-1, so it gets the smart quotes wrong.

if we know the input data is windows-1252, we can use iconv to convert it.

$string=iconv ( "Windows-1252" , "UTF-8" , $oldstring );

This is the real answer.