My manager asked me to explain why I called jdom’s checkCharacterData
before passing my string to an XMLStreamWriter
, so I referred to the XML spec and then got confused.
XML 1.0 and XML 1.1 say that a valid XML character is “tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646.” That sounds stupid: tab, carriage return, and line feed are legal characters of Unicode. Then there’s the comment “any Unicode character, excluding the surrogate blocks, FFFE, and FFFF,” which was modified in XML 1.1 to refer to U+0000 – U+10FFFF excluding U+0000, U+D800 – U+DFFF, and U+FFFE – U+FFFF; note that NUL is excluded. Then there’s the Note that says authors are “discouraged” from using the compatibility characters including some characters that are already excluded by the BNF.
Question: What is/was a legal Unicode character? Is NUL a valid Unicode character? (I found a pdf of ISO 10646 (2nd edition, 2010) which doesn’t seem to exclude U+0000.) Did ISO 10646 or Unicode change between the 2000 edition and the 2010 edition to include control characters that were previously excluded? And as for XML, is there a reason that the text is so lenient/sloppy while the BNF is strict?