1
votes

I'm generating an XML file from data held in a MySQL database using PHP using the DomDocument to create the XML structure but struggling with the apostrophe in some of the text. The file I'm trying to replicate from a legacy system encodes the apostrophe to '. When I echo the $dom->savexml() to the screen the results look ok (the apostrophe appears as ') but when using $dom->save to save the text to file, the apostrophe appears as ' i.e. it appears to be double escaping the text and encoding the ampersand.

I've been scouring many threads on this over the last few days to see if there is anything I've missed and my last round of testing has been based on the previous article here: PHP How to use quot; entities in XML with DOMdocument which was started nearly 4.5 years ago.

I've also tried different methods including using htmlspecialchars and htmlentities using various combinations of the flags and setting double encode to false.

Using htmlspecial characters, I'm following the advice in the PHP manual that single quotes are only translated where both ENT_QUOTES is set and ENT_XML1, ENT_XHTML or ENT_HTML5. I've tried all three of those.

Moving onto code examples to help illustrate the problem...

This is mostly taken from Jack's accepted answer to the question in the thread linked above with the addition with the addition of the htmlspecialchars function wrapped around the content for the text node.

$dom1 = new DOMDocument;

$e = $dom1->createElement('description');
$content = 'single quote: \', double quote: ", opening tag: <, ampersand: &, closing tag: this has changed 02 >';
$t = $dom1->createTextNode(htmlspecialchars($content, ENT_XML1 | ENT_QUOTES,'utf-8',false));
$e->appendChild($t);
$dom1->appendChild($e);

echo '#results: '.$dom1->savexml();

$test1 = $dom1->savexml();
$dom1->save("./exports/"."testing_dom.xml");

Echoing the results to screen gives the output I'm looking for, i.e. in the addition to the ampersand, less than and greater than characters being encoded to & < and > respectively, the double quote and single quote are encoded as " and ' which is what I'm looking for.

#results: single quote: &apos;, double quote: &quot;, opening tag: &lt;, ampersand: &amp;, closing tag: this has changed 02 &gt;

The last line of the code above saves the results to a testing_dom.xml file, the contents of which appear as follows:

<?xml version="1.0"?>
<description>single quote: &amp;apos;, double quote: &amp;quot;, opening tag: &amp;lt;, ampersand: &amp;amp;, closing tag: this has changed 02 &amp;gt;</description>

Here all of the characters seem to have the preceding ampersand of the entity double escaped i.e. &apos; becomes &amp;apos;

Is there something I'm missing here with saving the file?

1
can you not simply use a CDATA section?Professor Abronsius
“Echoing the results to screen gives the output I'm looking for” - because your browser does its job, and parses the HTML entities you created into the characters they represent. That does in no way mean that the ' in your data actually is that character. You need to apply htmlspecialchars the moment you are making this debug output, for the browser to display what your data actually contains - and then you will see, that that is &amp;apos;, same as you see when you check what actually got written to the file.misorude
Passing any data you treated with htmlspecialchars to createTextNode makes rather little sense to begin with - that method is named that way for a reason, its purpose is to create a text node, with what you passed as the argument becomming that text node’s actual text content. If you pass the characters &, a, p, o, s and ; in sequence, then those characters will be the text you get as result.misorude
Possible duplicate of PHP XML Entity Encoding issuemisorude

1 Answers

0
votes

DOMDocument escapes the special character as needed. In a text node inside an element node you do not need to escape the quotes. Inside a double quoted attribute " will be escaped as &quot;.

& is a special character itself - it is used for the entities. So will be escaped as &amp; always. If you use htmlspecialchars() on the $content, you trigger a double escaping - one done by yourself, the other by the XML serializer.

You goal should be to get the same value reading the generated XML.

$content = 'single quote: \', double quote: ", opening tag: <, ampersand: &, closing tag: this has changed 02 >';

// add content as text and attribute
$document = new DOMDocument();
$element = $document->appendChild($document->createElement('foo'));
$element->textContent = $content;
$element->setAttribute('attr', $content);

echo $xmlString = $document->saveXML();

// load the serialized XML and compare the values with $content
$document = new DOMDocument();
$document->loadXML($xmlString);

var_dump($document->documentElement->textContent === $content);
var_dump($document->documentElement->getAttribute('attr') === $content);

Output:

<?xml version="1.0"?>
<foo attr="single quote: ', double quote: &quot;, opening tag: &lt;, ampersand: &amp;, closing tag: this has changed 02 &gt;">single quote: ', double quote: ", opening tag: &lt;, ampersand: &amp;, closing tag: this has changed 02 &gt;</foo>
bool(true)
bool(true)

DOMNode::$nodeValue and the second argument of DOMDocument::createElement() are broken - they only do partial escaping and expect valid entities. Here are two ways to add a text node that will be properly escaped.

DOMElement::$textContent allows you to read/write the text content of an node. On write it will replace all existing child nodes with a text node.

DOMDocument::createTextNode() creates a text node with the provided content that can be added a parent node. This allows for mixed children.