I want to perform certain manipulations on a XML document with PHP using the DOM part of its standard library. As others have already discovered, one has to deal with decoded entities then. To illustrate what bothers me, I give a quick example.
Suppose we have the following code
$doc = new DOMDocument();
$doc->loadXML(<XML data>);
$xpath = new DOMXPath($doc);
$node_list = $xpath->query(<some XPath>);
foreach($node_list as $node) {
//do something
}
If the code in the loop is something like
$attr = "<some string>";
$val = $node->getAttribute($attr);
//do something with $val
$node->setAttribute($attr, $val);
it works fine. But if it's more like
$text = $node->textContent;
//do something with $text
$node->nodeValue = $text;
and $text
contains some decoded &, it doesn't get encoded, even if one does nothing with $text
at all.
At the moment, I apply htmlspecialchars on $text
before I set $node->nodeValue
to it. Now I want to know
- if that is sufficient,
- if not, what would suffice,
- and if there are more elegant solutions for this, as in the case of attribute manipulation.
The XML documents I have to deal with are mostly feeds, so a solution should be pretty general.
EDIT
It turned out that my original question had the wrong scope, sorry for that. Here I provide an example where the described behaviour actually happens.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://feeds.bbci.co.uk/news/rss.xml?edition=uk");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument();
$doc->loadXML($output);
$xpath = new DOMXPath($doc);
$node_list = $xpath->query('//item/link');
foreach($node_list as $node) {
$node->nodeValue = $node->textContent;
}
echo $doc->saveXML();
If I execute this code on the CLI with
php beeb.php |egrep 'link|Warning'
I get results like
<link>http://www.bbc.co.uk/news/world-africa-23070006#sa-ns_mchannel=rss</link>
which should be
<link>http://www.bbc.co.uk/news/world-africa-23070006#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa</link>
(and is, if the loop is omitted) and according warnings
Warning: main(): unterminated entity reference ns_source=PublicRSS20-sa in /private/tmp/beeb.php on line 15
When I apply htmlspecialchars
to $node->textContent
, it works fine, but I feel very uncomfortable doing that.