4
votes

I am working on modifying the contents of an XML file generated by some other library. I'm making some DOM modifications with PHP (5.3.10) and reinserting a replacement node.

The XML data I'm working with has " elements before I do the manipulation and I want to keep those elements as per http://www.w3.org/TR/REC-xml/ when I'm done with the modifications.

However I'm having problems with PHP changing the " elements. See my example.

$temp = 'Hello "XML".';
$doc = new DOMDocument('1.0', 'utf-8');
$newelement = $doc->createElement('description', $temp);
$doc->appendChild($newelement);
echo $doc->saveXML() . PHP_EOL; // shows " instead of element
$node = $doc->getElementsByTagName('description')->item(0);
echo $node->nodeValue . PHP_EOL; // also shows "

Output

<?xml version="1.0" encoding="utf-8"?> 
<description>Hello "XML".</description>

Hello "XML".

Is this a PHP error or am I doing something wrong? I hope it isn't necessary to use createEntityReference in every char location.

Similar Question: PHP XML Entity Encoding issue


EDIT: As an example to show saveXML should not be converting the &quot; entities just like the &amp; which behaves properly. This $temp string should really be output as it is initially entered with the entities during saveXML().

$temp = 'Hello &quot;XML&quot; &amp;.';
$doc = new DOMDocument('1.0', 'utf-8');
$newelement = $doc->createElement('description', $temp);
$doc->appendChild($newelement);
echo $doc->saveXML() . PHP_EOL; // shows " instead of element like &amp;
$node = $doc->getElementsByTagName('description')->item(0);
echo $node->nodeValue . PHP_EOL; // also shows " &

Output

<?xml version="1.0" encoding="utf-8"?>
<description>Hello "XML" &amp;.</description>

Hello "XML" &.
1
Maybe this is of some use? Interesting - I created a new DOMText($temp); as a text node then appended that to $newelement (an empty <description> node, and the result I got was almost right: <description>Hello &amp;quot;XML&amp;quot;.</description>Michael Berkowski
@MichaelBerkowski That is interesting. If you used my string $temp which was already encoded, then your method double encoded it, but it did kept the encoding properly during saveXML. Can you describe more about what you're doing? I get a 'Invalid Character Error' when I try the DOMText.user6972
I don't see what's wrong with having double quotes unencoded in an element's node value? They get escaped only when inside attribute values.Ja͢ck
@Ja͢ck the XML spec is for double quotes to be encoded inside any text node.user6972
Well, the spec only mentions & and < to require escaping in the contents; escaping of single and double quotes is only applicable in attributes.Ja͢ck

1 Answers

1
votes

The answer is that it doesn't actually need any escaping according to the spec (skipping the mentions of CDATA):

The ampersand character (&) and the left angle bracket (<) must not appear in their literal form (...) If they are needed elsewhere, they must be escaped using either numeric character references or the strings " &amp; " and " &lt; " respectively. The right angle bracket (>) may be represented using the string " &gt; " (...)

To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as " &apos; ", and the double-quote character (") as " &quot; ".

You can verify this easily by using createTextNode() to perform the correct escaping:

$dom = new DOMDocument;
$e = $dom->createElement('description');
$content = 'single quote: \', double quote: ", opening tag: <, ampersand: &, closing tag: >';
$t = $dom->createTextNode($content);
$e->appendChild($t);
$dom->appendChild($e);

echo $dom->saveXML();

Output:

<?xml version="1.0"?>
<description>single quote: ', double quote: ", opening tag: &lt;, ampersand: &amp;, closing tag: &gt;</description>