XML and Unicode specifications: what’s a legal character?

Question

My manager asked me to explain why I called jdom’s checkCharacterData before passing my string to an XMLStreamWriter, so I referred to the XML spec and then got confused.

XML 1.0 and XML 1.1 say that a valid XML character is “tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646.” That sounds stupid: tab, carriage return, and line feed are legal characters of Unicode. Then there’s the comment “any Unicode character, excluding the surrogate blocks, FFFE, and FFFF,” which was modified in XML 1.1 to refer to U+0000 – U+10FFFF excluding U+0000, U+D800 – U+DFFF, and U+FFFE – U+FFFF; note that NUL is excluded. Then there’s the Note that says authors are “discouraged” from using the compatibility characters including some characters that are already excluded by the BNF.

Question: What is/was a legal Unicode character? Is NUL a valid Unicode character? (I found a pdf of ISO 10646 (2nd edition, 2010) which doesn’t seem to exclude U+0000.) Did ISO 10646 or Unicode change between the 2000 edition and the 2010 edition to include control characters that were previously excluded? And as for XML, is there a reason that the text is so lenient/sloppy while the BNF is strict?

Mike Samuel Mike Samuel · Accepted Answer · 2012-03-02T02:00:00

Question: What is/was a legal Unicode character?

The Unicode Glossary defines it thus:

Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).]

Is NUL a valid Unicode character? (I found a pdf of ISO 10646 (2nd edition, 2010) which doesn’t seem to exclude U+0000.)

NUL is a codepoint, and it falls under the definition of "abstract character" so it is a character by sense 2 above.

Did ISO 10646 or Unicode change between the 2000 edition and the 2010 edition to include control characters that were previously excluded?

NUL has been a control character from early versions. Appendix D contains a list of changes.

It says in table D.2 that there have been 65 control characters from Version 1 through Version 3 without change.

Table D-2 documents the number of characters assigned in the different versions of the Unicode standard.
         V1.0 V1.1 V2.0 V2.1 V3.0
...
Controls   65   65   65   65   65

And as for XML, is there a reason that the text is so lenient/sloppy while the BNF is strict?

Writing specifications that are both complete and succinct is hard. When the text disagrees with the BNF, trust the BNF.

XML and Unicode specifications: what’s a legal character?

4 Answers