XML attribute-value normalisation - how should whitespace in entities be treated?

Question

I'm trying to ascertain what should happen when an XML parser reads in attribute a of element x in the sample below:

<!DOCTYPE x [
  <!ELEMENT x EMPTY>
  <!ATTLIST x a CDATA #IMPLIED>
  <!ENTITY d "&#xD;">
  <!ENTITY a "&#xA;">
  <!ENTITY t "&#x9;">
  <!ENTITY t2 " "><!-- a real tab-->
]>
<x a="CARRIAGE_RETURNS:(&d;&#xD;),NEWLINES:(&a;&#xA;),TABS:(&t;&#x9;&t2; )"/><!-- a real tab at the end -->

The essential part of the Attribute-Value Normalization rules in the spec involves traversing the attribute value and applying this case statement:

For a character reference, append the referenced character to the normalized value.
For an entity reference, recursively apply step 3 [that's the case statement] of this algorithm to the replacement text of the entity. [EDIT: replacement text, as distinct from literal entity value, seems to be the key concept in understanding what's going on. See below.]
For a white space character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value.
For another character, append the character to the normalized value.

My reading of those rules would lead me to think that the output of the XML parser for the attribute value should be as follows (interpretation: the same rules apply whether in attribute or entity - character references preserved, actual characters replaced):

CARRIAGE_RETURNS:([CR][CR]),NEWLINES:([NL][NL]),TABS:([TAB][TAB][SPACE][SPACE])

However, the example given a little bit below that in the spec suggests that the output should be as follows, and a Java test I wrote works in exactly that way (interpretation: if it's an entity value, it's always a replacement):

CARRIAGE_RETURNS:([SPACE][CR]),NEWLINES:([SPACE][NL]),TABS:([SPACE][TAB][SPACE][SPACE])

On the other hand, a test I wrote in PHP outputs this (interpretation: if it's an entity value, it's never a replacement):

CARRIAGE_RETURNS:([CR][CR]),NEWLINES:([NL][NL]),TABS:([TAB][TAB][TAB][SPACE])

Similar output is given by running the xml file through an identity XSLT transform using the xsltproc tool:

<x a="CARRIAGE_RETURNS:(&#13;&#13;),NEWLINES:(&#10;&#10;),TABS:(&#9;&#9;&#9; )"/>

So my question is: what should happen and why?

Sample PHP and Java programs below:

PHP:

// Library versions from phpinfo():
// DOM/XML API Version  20031129
// libxml Version  2.6.32 
$doc = new DOMDocument();
$doc->load("t.xml");
echo str_replace(array("\t", " ", "\r", "\n"), array("[TAB]", "[SPACE]", "[CR]", "[NL]"), $doc->documentElement->getAttribute("a")), "\n";

Java:

import java.io.*;
class T{

  public static void main(String[] args) throws Exception {
    String xmlString = readFile(args[0]);
    System.out.println(xmlString);
    org.w3c.dom.Document doc =
      javax.xml.parsers.DocumentBuilderFactory.newInstance().
      newDocumentBuilder().
      parse(new org.xml.sax.InputSource(new StringReader(xmlString)));
    System.out.println(doc.getImplementation());
    System.out.println(
      doc.
      getDocumentElement().
      getAttribute("a").
      replace("\t", "[TAB]").
      replace(" ", "[SPACE]").
      replace("\r", "[CR]").
      replace("\n", "[NL]")
    );
  }

  // Very rough, but works in this case
  private static String readFile(String fileName) throws IOException {
    File file = new File(fileName);
    InputStream inputStream = new FileInputStream(file);
    byte[] buffer = new byte[(int)file.length()];
    int length = inputStream.read(buffer);
    String result = new String(buffer, 0, length);
    inputStream.close();
    return result;
  }

}

Paul Clapham Paul Clapham · Accepted Answer · 2010-01-29T22:14:10

So the question is, is the replacement text of the entity a carriage-return character, or is it the character entity which represents a carriage-return character?

And if you look at the examples in Appendix D of the XML Recommendation (especially the one described as "a more complex example") it appears the replacement text (in your example) should be a carriage-return character, and not the character entity. Which means that your "Java test" is the correct one. At least, that's if my interpretation of the appendix is correct.

However note that Appendix D is non-normative, which means you would have to read the body of the Recommendation to find out the actual rules. I believe that's section 4.4, but that table just made my head hurt.

XML attribute-value normalisation - how should whitespace in entities be treated?

2 Answers