I'm using Java 11 (AdoptOpenJDK 11.0.5 2019-10-15) on Windows 10. I'm parsing some legacy XHTML 1.1 files, which take the following general form:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" http://www.w3.org/MarkUp/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>XHTML 1.1 Skeleton</title>
</head>
<body>
</body>
</html>
I'm using a simple non-validating parser:
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
final Document document;
try (InputStream inputStream = new BufferedInputStream(getClass().getResourceAsStream("xhtml-1.1-test.xhtml"))) {
document = documentBuilder.parse(inputStream);
}
For some reason it's adding extra attributes such as xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
and xml:space="preserve"
all over the place:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" version="-//W3C//DTD XHTML 1.1//EN" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="en">
<head xmlns="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<title xmlns="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">XHTML 1.1 Skeleton</title>
</head>
<body xmlns="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:space="preserve"></body>
</html>
I know that DTDs can provide default attribute values, but I don't understand why the xmlns:xsi
attribute was added, when there appear to be no elements or attributes in that namespace.
Furthermore xml:space="preserve"
seems to be incorrect altogether; only elements like <pre>
should have xml:space="preserve"
set, I would think. (Update: The HTML5 specification indicates that HTML by default preserves space, and that xml:space
must not be serialized in HTML, so maybe that was part of the reasoning here. I will improve my HTML serializer to ignore the xml:space
attribute, which will partially mitigate this issue.)
Also note the version="-//W3C//DTD XHTML 1.1//EN"
as well; that's something I don't need or want.
Am I doing something wrong? Is there a way I can configure the parser not to include this unnecessary cruft?
Interestingly this is not a problem with XHTML 1.0 strict.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>XHTML 1.0 Skeleton</title>
</head>
<body>
</body>
</html>
When parsed that yields what one would expect:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>XHTML 1.0 Skeleton</title>
</head>
<body>
</body>
</html>
But it is a problem with -//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN
. So this seems to be just an XHTML 1.1 problem.
Update: I have some potentially helpful news: if I create a new document without a DTD and import the entire document tree into the new document, all this cruft (which apparently comes from implied attributes in the DTD) goes away, because the destination document doesn't have a DTD at all. (See How to force removal of attributes with implied default values from DTD in Java XML DOM .) But this is very inefficient; it would be nice to turn this off altogether when parsing.
DefaultEntityResolver
? Also, regarding "For some reason it's adding extra attributes", do you mean that it is added to the original xml file? If not, how do you convert theDocument
object to XML string (or saving it to a file)? – Eng.FouadDefaultEntityResolver
is an entity resolver that supplies the entities (such as DTDs) locally instead of downloading them from the Internet. See the ticket globalmentor.atlassian.net/browse/JAVA-175 , or clone commit bitbucket.org/globalmentor/… if you want to verify the contents of the entities. – Garret WilsonDefaultEntityResolver
, then it won't even parse, and crashes because of ajava.io.FileNotFoundException
forhttp://www.w3.org/TR/xhtml11/DTD/xhtml-datatypes-1.mod
. Apparently the W3C has removed this entity from its website altogether. (So does no one parse XHTML 1.1 files anymore with standard Java XML parsers? This is completely broken.) So I must use a custom entity resolver with locally stored entities or it won't even parse. @Olivier have you tried it yourself? – Garret Wilson