4
votes

I'm using Java 11 (AdoptOpenJDK 11.0.5 2019-10-15) on Windows 10. I'm parsing some legacy XHTML 1.1 files, which take the following general form:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" http://www.w3.org/MarkUp/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
  <title>XHTML 1.1 Skeleton</title>
</head>
<body>
</body>
</html>

I'm using a simple non-validating parser:

DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
final Document document;
try (InputStream inputStream = new BufferedInputStream(getClass().getResourceAsStream("xhtml-1.1-test.xhtml"))) {
  document = documentBuilder.parse(inputStream);
}

For some reason it's adding extra attributes such as xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" and xml:space="preserve" all over the place:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" version="-//W3C//DTD XHTML 1.1//EN" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="en">
<head xmlns="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <title xmlns="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">XHTML 1.1 Skeleton</title>
</head>
<body xmlns="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:space="preserve"></body>
</html>

I know that DTDs can provide default attribute values, but I don't understand why the xmlns:xsi attribute was added, when there appear to be no elements or attributes in that namespace.

Furthermore xml:space="preserve" seems to be incorrect altogether; only elements like <pre> should have xml:space="preserve" set, I would think. (Update: The HTML5 specification indicates that HTML by default preserves space, and that xml:space must not be serialized in HTML, so maybe that was part of the reasoning here. I will improve my HTML serializer to ignore the xml:space attribute, which will partially mitigate this issue.)

Also note the version="-//W3C//DTD XHTML 1.1//EN" as well; that's something I don't need or want.

Am I doing something wrong? Is there a way I can configure the parser not to include this unnecessary cruft?

Interestingly this is not a problem with XHTML 1.0 strict.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>XHTML 1.0 Skeleton</title>
</head>
<body>
</body>
</html>

When parsed that yields what one would expect:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>XHTML 1.0 Skeleton</title>
</head>
<body>
</body>
</html>

But it is a problem with -//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN. So this seems to be just an XHTML 1.1 problem.

Update: I have some potentially helpful news: if I create a new document without a DTD and import the entire document tree into the new document, all this cruft (which apparently comes from implied attributes in the DTD) goes away, because the destination document doesn't have a DTD at all. (See How to force removal of attributes with implied default values from DTD in Java XML DOM .) But this is very inefficient; it would be nice to turn this off altogether when parsing.

2
What is DefaultEntityResolver? Also, regarding "For some reason it's adding extra attributes", do you mean that it is added to the original xml file? If not, how do you convert the Document object to XML string (or saving it to a file)?Eng.Fouad
DefaultEntityResolver is an entity resolver that supplies the entities (such as DTDs) locally instead of downloading them from the Internet. See the ticket globalmentor.atlassian.net/browse/JAVA-175 , or clone commit bitbucket.org/globalmentor/… if you want to verify the contents of the entities.Garret Wilson
When I say, "for some reason it's adding extra attributes", I mean that it's adding them to the DOM tree. I'm using a custom XML serializer I wrote myself to print out the tree.Garret Wilson
What happens if you remove the DefaultEntityResolver?Olivier
If I remove the DefaultEntityResolver, then it won't even parse, and crashes because of a java.io.FileNotFoundException for http://www.w3.org/TR/xhtml11/DTD/xhtml-datatypes-1.mod. Apparently the W3C has removed this entity from its website altogether. (So does no one parse XHTML 1.1 files anymore with standard Java XML parsers? This is completely broken.) So I must use a custom entity resolver with locally stored entities or it won't even parse. @Olivier have you tried it yourself?Garret Wilson

2 Answers

0
votes

I've found a workaround, although it's not ideal. The idea is that when a document asks to be parsed with the XHTML 1.1 DTD -//W3C//DTD XHTML 1.1//EN, to really use the XHTML 1.0 Strict DTD -//W3C//DTD XHTML 1.0 Strict//EN instead. For most practical purposes this DTD is effectively almost the same as the one they asked for, but it doesn't bring in all the default cruft.

Remembering that DefaultEntityResolver is my entity resolver with most of the XHTML DTDs predefined (see Complete list of XHTML, MathML, and SVG modules and other entities, with public identifiers?), the implementation looks something like this:

private static final EntityResolver XHTML_1_1_TO_XHTML_1_0_ENTITY_RESOLVER =
    new EntityResolver() {

  private final EntityResolver defaultEntityResolver = DefaultEntityResolver.getInstance();

  @Override
  public InputSource resolveEntity(final String publicID, final String systemID)
      throws SAXException, IOException {
    if(XHTML_1_1_PUBLIC_ID.equals(publicID)) {
      final InputSource inputSource = resolveEntity(XHTML_1_0_STRICT_PUBLIC_ID, systemID);
      inputSource.setPublicId(publicID);
      return inputSource;
    }
    return defaultEntityResolver.resolveEntity(publicID, systemID);
  }

};

Then I would use that entity resolver when parsing:

documentBuilder.setEntityResolver(XHTML_1_1_TO_XHTML_1_0_ENTITY_RESOLVER);

It's somewhat of a kludge, and semantically I don't like it. But for my application I just need a clean, well-formed parsed document with correct entity replacement, so in practice it may produce effectively the same results for most documents.

0
votes

Have you tried the nonvalidating/load-dtd-grammar Xerces configuration feature?

However, I've just been looking at how I do this in Saxon, and I don't ask the XML parser to not-report defaulted attributes, rather I discard them when they are reported. I'm using Xerces as a SAX parser not a DOM parser though. (In SAX, defaulted attributes are reported using Attributes2.isDefaulted()).