0
votes

So for example, an ISO-8859-1 encoded XML document that even has some characters that are not part of the character set of that encoding, let's say the € (euro) symbol. This is possible in XML if the symbol is represented as a unicode character entity, in this case the € (euro) string:

<?xml version="1.0" encoding="ISO-8859-1"?>
<foo>
    <bar>&#8364;</bar>
</foo>

I need to obtain the bar element string with the same encoding as the document, which means encoded in ISO-8859-1 (also means to preserve the unicode character entities that are not part of this encoding), i.e. the ISO-8859-1 string <bar>&#8364;</bar>.

I couldn't achieve this by using the saveXML method of the DOMDocument class, since it dumps elements always in UTF-8 (whilst whole documents always in the encoding of their XML declaration):

$DD = new DOMDocument;
$DD -> load('foo.xml');
$dump = $DD -> saveXML($DD -> getElementsByTagName('bar') -> item(0));

The $dump variable resulted in the UTF-8 string <bar>€</bar>.

Notice how elements are dumped also with its unicode character entities traduced to actual UTF-8 characters.

So, how would I get the ISO-8859-1 string <bar>&#8364;</bar>? Are XML parsers meant to work this sort of task or should I just utilize regular expressions o something else?

2

2 Answers

1
votes

Yes, they will decode entities and if you only save a part of a document it will be UTF-8 because it has no way to specify the encoding - it defaults back to UTF-8.

Here is a demo:

$xml = <<<'XML'
<?xml version="1.0" encoding="ISO-8859-1"?>
<foo>
    <bar>&#8364;</bar>
</foo>
XML;

$source = new DOMDocument();
$source->loadXML($xml);

echo "Document Part:\n";
echo $source->saveXML($source->getElementsByTagName('bar')->item(0));
echo "\n\n";

echo "Whole Document:\n";
echo $source->saveXML();
echo "\n\n";

Output:

Document Part:
<bar>€</bar>

Whole Document:
<?xml version="1.0" encoding="ISO-8859-1"?>
<foo>
    <bar>&#8364;</bar>
</foo>

You could copy the node into a new document. However the output will include the XML declaration with the encoding:

$target = new DOMDocument('1.0', 'ASCII');
$target->appendChild($target->importNode($source->getElementsByTagName('bar')->item(0), true));

echo "Separated Node:\n";
echo $target->saveXML();

Output:

Separated Node:
<?xml version="1.0" encoding="ASCII"?>
<bar>&#8364;</bar>
1
votes

It looks like the encoding is not used when saveXML() is used with a node argument. When you set the $encoding property on the DOMDocument class it will be used in the saveXML() function, but only when saving the whole document. By checking the source code of the saveXML() function you will see there is even a comment mentioning the encoding property:

if (nodep != NULL) {
    [...]
} else {
    [...]
    /* Encoding is handled from the encoding property set on the document */
    xmlDocDumpFormatMemory(docp, &mem, &size, format);
}

According to the Document Object Model (DOM) Level 3 Load and Save Specification a lot of defined types support setting the encoding (and the PHP implementation has it at least on the DOMDocument class). So I'm not sure if it is a bug in the implementation of DOM in PHP. However, the documentation also states that it uses UTF-8 encoding:

Note:

The DOM extension uses UTF-8 encoding. Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or iconv for other encodings.

So, the solution would be to use such functions to convert it to the correct result or only save the whole XML document with saveXML() without any arguments given.