4
votes

I want to perform certain manipulations on a XML document with PHP using the DOM part of its standard library. As others have already discovered, one has to deal with decoded entities then. To illustrate what bothers me, I give a quick example.

Suppose we have the following code

$doc = new DOMDocument();
$doc->loadXML(<XML data>);

$xpath = new DOMXPath($doc);
$node_list = $xpath->query(<some XPath>);

foreach($node_list as $node) {
    //do something
}

If the code in the loop is something like

$attr = "<some string>";
$val = $node->getAttribute($attr);
//do something with $val
$node->setAttribute($attr, $val);

it works fine. But if it's more like

$text = $node->textContent;
//do something with $text
$node->nodeValue = $text;

and $text contains some decoded &, it doesn't get encoded, even if one does nothing with $text at all.

At the moment, I apply htmlspecialchars on $text before I set $node->nodeValue to it. Now I want to know

  1. if that is sufficient,
  2. if not, what would suffice,
  3. and if there are more elegant solutions for this, as in the case of attribute manipulation.

The XML documents I have to deal with are mostly feeds, so a solution should be pretty general.


EDIT

It turned out that my original question had the wrong scope, sorry for that. Here I provide an example where the described behaviour actually happens.

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://feeds.bbci.co.uk/news/rss.xml?edition=uk");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);

$doc = new DOMDocument();
$doc->loadXML($output);

$xpath = new DOMXPath($doc);
$node_list = $xpath->query('//item/link');

foreach($node_list as $node) {
        $node->nodeValue = $node->textContent;
}
echo $doc->saveXML();

If I execute this code on the CLI with

php beeb.php |egrep 'link|Warning'

I get results like

<link>http://www.bbc.co.uk/news/world-africa-23070006#sa-ns_mchannel=rss</link>

which should be

<link>http://www.bbc.co.uk/news/world-africa-23070006#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa</link>

(and is, if the loop is omitted) and according warnings

Warning: main(): unterminated entity reference ns_source=PublicRSS20-sa in /private/tmp/beeb.php on line 15

When I apply htmlspecialchars to $node->textContent, it works fine, but I feel very uncomfortable doing that.

2
Pretty good question. Just by reading it I found the solution to my similar problem. Thank you.zVictor

2 Answers

8
votes

Your question is basically whether or not setting DOMText::nodeValue to an XML encoded string or to a verbatim string.

So let's just try that out and set it to & and '&amp; and see what happens:

$doc = new DOMDocument();
$doc->loadXML('<root>*</root>');

$text = $doc->documentElement->childNodes->item(0);

echo "Before Edit: ", $doc->saveXML($text), "\n";

$text->nodeValue = "&";

echo "After Edit 1: ", $doc->saveXML($text), "\n";

$text->nodeValue = "&amp;";

echo "After Edit 2: ", $doc->saveXML($text), "\n";

The output then is as the following (PHP 5.0.0 - 5.5.0):

Before Edit: *
After Edit 1: &amp;
After Edit 2: &amp;amp;

This shows that setting the nodeValue of a DOMText-node expects a UTF-8 encoded string and the DOM library encodes the XML reserved characters automatically.

So you should not apply htmlspecialchars() onto any text you add this way. That would create a double-encoding.

As you write you experience the opposite I suggest you to execute an isolated PHP example on the commandline / within your IDE so that you can see exactly the output. Not that your browser renders this as HTML and then you think the reserved XML characters have not been encoded.


As you have pointed out you're not editing a DOMText but an DOMElement node. It works a bit different, here the & character needs to be passed as entity &amp; instead of verbatim , however only this character.

So this needs a little bit more work:

  1. Read out the text-content and turn it into a DOMText node. Everything will be perfectly encoded.
  2. Remove the node-value of the element node so it's empty.
  3. Append the DOMText node form first step as child.

And done. Here your inner foreach modified showing this:

foreach($node_list as $node) {
    $text = $doc->createTextNode($node->textContent);
    $node->nodeValue = "";
    $node->appendChild($text);
}

For your concrete example albeit I must admit I don't understand why you do that because this does not change the value so it wouldn't need this.

Tip: In PHP DOMDocument can open this feed directly, you don't need curl here:

$doc = new DOMDocument();
$doc->load("http://feeds.bbci.co.uk/news/rss.xml?edition=uk");
2
votes

As hakre explained, the problem is that in PHP's DOM library, the behaviour of setting nodeValue w.r.t. entities depends on the class of the node, in particular DOMText and DOMElement differ in this regard. To illustrate this, an example:

$doc = new DOMDocument();
$doc->formatOutput = True;
$doc->loadXML('<root/>');

$s = 'text &amp;&lt;<"\'&text;&text';

$root = $doc->documentElement;

$node = $doc->createElement('tag1', $s); #line 10
$root->appendChild($node);

$node = $doc->createElement('tag2');
$text = $doc->createTextNode($s);
$node->appendChild($text);
$root->appendChild($node);

$node = $doc->createElement('tag3');
$text = $doc->createCDATASection($s);
$node->appendChild($text);
$root->appendChild($node);

echo $doc->saveXML();

outputs

Warning: DOMDocument::createElement(): unterminated entity reference            text in /tmp/DOMtest.php on line 10
<?xml version="1.0"?>
<root>
  <tag1>text &amp;&lt;&lt;"'&text;</tag1>
  <tag2>text &amp;amp;&amp;lt;&lt;"'&amp;text;&amp;text</tag2>
  <tag3><![CDATA[text &amp;&lt;<"'&text;&text]]></tag3>
</root>

In this particular case, it is appropriate to alter the nodeValue of DOMText nodes. Combining hakre's two answers one gets a quite elegant solution.

$doc = new DOMDocument();
$doc->loadXML(<XML data>);

$xpath     = new DOMXPath($doc);
$node_list = $xpath->query(<some XPath>);

$visitTextNode = function (DOMText $node) {
    $text = $node->textContent;
    /*
        do something with $text
    */
   $node->nodeValue = $text;
};

foreach ($node_list as $node) {
    if ($node->nodeType == XML_TEXT_NODE) {
        $visitTextNode($node);
    } else {
        foreach ($node->childNodes as $child) {
            if ($child->nodeType == XML_TEXT_NODE) {
                $visitTextNode($child);
            }
        }
    }
}