2
votes

I'm trying to ascertain what should happen when an XML parser reads in attribute a of element x in the sample below:

<!DOCTYPE x [
  <!ELEMENT x EMPTY>
  <!ATTLIST x a CDATA #IMPLIED>
  <!ENTITY d "&#xD;">
  <!ENTITY a "&#xA;">
  <!ENTITY t "&#x9;">
  <!ENTITY t2 " "><!-- a real tab-->
]>
<x a="CARRIAGE_RETURNS:(&d;&#xD;),NEWLINES:(&a;&#xA;),TABS:(&t;&#x9;&t2; )"/><!-- a real tab at the end -->

The essential part of the Attribute-Value Normalization rules in the spec involves traversing the attribute value and applying this case statement:

  • For a character reference, append the referenced character to the normalized value.
  • For an entity reference, recursively apply step 3 [that's the case statement] of this algorithm to the replacement text of the entity. [EDIT: replacement text, as distinct from literal entity value, seems to be the key concept in understanding what's going on. See below.]
  • For a white space character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value.
  • For another character, append the character to the normalized value.

My reading of those rules would lead me to think that the output of the XML parser for the attribute value should be as follows (interpretation: the same rules apply whether in attribute or entity - character references preserved, actual characters replaced):

CARRIAGE_RETURNS:([CR][CR]),NEWLINES:([NL][NL]),TABS:([TAB][TAB][SPACE][SPACE])

However, the example given a little bit below that in the spec suggests that the output should be as follows, and a Java test I wrote works in exactly that way (interpretation: if it's an entity value, it's always a replacement):

CARRIAGE_RETURNS:([SPACE][CR]),NEWLINES:([SPACE][NL]),TABS:([SPACE][TAB][SPACE][SPACE])

On the other hand, a test I wrote in PHP outputs this (interpretation: if it's an entity value, it's never a replacement):

CARRIAGE_RETURNS:([CR][CR]),NEWLINES:([NL][NL]),TABS:([TAB][TAB][TAB][SPACE])

Similar output is given by running the xml file through an identity XSLT transform using the xsltproc tool:

<x a="CARRIAGE_RETURNS:(&#13;&#13;),NEWLINES:(&#10;&#10;),TABS:(&#9;&#9;&#9; )"/>

So my question is: what should happen and why?

Sample PHP and Java programs below:

PHP:

// Library versions from phpinfo():
// DOM/XML API Version  20031129
// libxml Version  2.6.32 
$doc = new DOMDocument();
$doc->load("t.xml");
echo str_replace(array("\t", " ", "\r", "\n"), array("[TAB]", "[SPACE]", "[CR]", "[NL]"), $doc->documentElement->getAttribute("a")), "\n";

Java:

import java.io.*;
class T{

  public static void main(String[] args) throws Exception {
    String xmlString = readFile(args[0]);
    System.out.println(xmlString);
    org.w3c.dom.Document doc =
      javax.xml.parsers.DocumentBuilderFactory.newInstance().
      newDocumentBuilder().
      parse(new org.xml.sax.InputSource(new StringReader(xmlString)));
    System.out.println(doc.getImplementation());
    System.out.println(
      doc.
      getDocumentElement().
      getAttribute("a").
      replace("\t", "[TAB]").
      replace(" ", "[SPACE]").
      replace("\r", "[CR]").
      replace("\n", "[NL]")
    );
  }

  // Very rough, but works in this case
  private static String readFile(String fileName) throws IOException {
    File file = new File(fileName);
    InputStream inputStream = new FileInputStream(file);
    byte[] buffer = new byte[(int)file.length()];
    int length = inputStream.read(buffer);
    String result = new String(buffer, 0, length);
    inputStream.close();
    return result;
  }

}
2

2 Answers

1
votes

So the question is, is the replacement text of the entity a carriage-return character, or is it the character entity which represents a carriage-return character?

And if you look at the examples in Appendix D of the XML Recommendation (especially the one described as "a more complex example") it appears the replacement text (in your example) should be a carriage-return character, and not the character entity. Which means that your "Java test" is the correct one. At least, that's if my interpretation of the appendix is correct.

However note that Appendix D is non-normative, which means you would have to read the body of the Recommendation to find out the actual rules. I believe that's section 4.4, but that table just made my head hurt.

1
votes

Section 4.5: Construction of Entity Replacement Text of the XML spec defines two important distinctions.

  • For every entity there's a distinction between its literal entity value and the replacement text that's extracted from its literal value.
  • There are different rules for this mapping depending on whether it's an internal or an external entity.

An external entity, for our current purposes, can be thought of as being like an include file in C or PHP - it's a file or another external resource whose content is inserted and then processed. An internal entity is carried in the payload of the DTD, and to ensure that arbitrary internal entities can be carried without being mixed up with the DTD syntax, it's carried in an escaped form known as the literal entity value. In order to convert the literal entity value to its replacement text, the following rule is applied:

For an internal entity, the replacement text is the content of the entity, after replacement of character references and parameter-entity references.

So:

  • A literal entity value of "[TAB]" maps to the replacement text [TAB]. I'm declaring here an ad-hoc escape mechanism where [TAB] means the tab character, since I can't type a tab into this textbox and have it understood - I hope that doesn't confuse things, but rather demonstrates the fact there are good reasons to have escape mechanisms, so the important thing is to understand where they're being used and how something that looks complicated can be decomposed into different levels of escape mechanism.
  • A literal entity value of "&x9;" also maps to the replacement text [TAB]. So as far as the attribute-value normalisation logic is concerned, it is a tab, and it doesn't know that it was represented in the internal entity using a character reference. It might seem like that's redundant or that some information is lost, but not really - escape mechanisms allow you to escape anything, including things that you don't need to escape - for example you could probably replace every use of the Latin lower case a in a HTML file by &#x61; and neither gain nor lose information.
  • A literal entity value of "&#38;#x9;" maps to the replacement text &#x9;. The attribute-value normalisation logic will interpret that as a character reference for a tab, and will normalise its value as a tab rather than collapsing it.
  • A literal entity value of "&#38;#38;#x9;" maps to the replacement text &#38;#x9;
  • And so on...

It seems like some sort of off-by-one or double-encoding error that in order for a [TAB] to show up in an attribute value, your internal entity has to contain the literal text &#38;#x9;. The impression of a double-encoding error is created by the fact that DTD's happen to use the same character escape mechanism as XML does, but for different reasons. If DTD's used a different escape mechanism, for example along the lines of \u0009 for a tab, then the literal entity value would contain \uyyyy-escaped characters interspersed with &#xyyyy-escaped characters and we could always tell what escape mechanism belonged to what level. Anyway, that's not the way it's done, so we have to just have a good idea of what's going on... it's like for example if you're writing a regex to detect backslashes, you have to escape the backslash in the regex by doubling it, and if you're using a language without regex literals, you have to put it in a string with correct escapes, so it ends up as four backslashes in a row, which looks completely wrong but it's right when you think about the interaction of different levels of escape mechanism (by the way, I originally tried to write out those backslashes, but in order to get around Stackoverflow's own escape mechanism I would have had to write eight backslashes in a row, and it didn't feel safe to write that)

The above seems ok to me at the moment as an explanation of the spec and of the Java implementation as demonstrated in the sample code. It's obviously not consistent with the PHP sample, and I don't mean to imply that there's a bug - the PHP DOM implementation sits on top of a mature C library, with a lot of configuration options, one or more of which might be tweakable to get behaviour that's consistent with the Java sample. Examples like this bring home to me how complicated XML is... simplified explanations like the one I give above may be useful to get a broadgrained idea of what goes on 95% of the time, but the other 5% can be very hard to understand and explain. So if there's a flaw with my explanation, or you have a better explanation, please add a comment or another answer, the more pedantic the better.