2
votes

My manager asked me to explain why I called jdom’s checkCharacterData before passing my string to an XMLStreamWriter, so I referred to the XML spec and then got confused.

XML 1.0 and XML 1.1 say that a valid XML character is “tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646.” That sounds stupid: tab, carriage return, and line feed are legal characters of Unicode. Then there’s the comment “any Unicode character, excluding the surrogate blocks, FFFE, and FFFF,” which was modified in XML 1.1 to refer to U+0000 – U+10FFFF excluding U+0000, U+D800 – U+DFFF, and U+FFFE – U+FFFF; note that NUL is excluded. Then there’s the Note that says authors are “discouraged” from using the compatibility characters including some characters that are already excluded by the BNF.

Question: What is/was a legal Unicode character? Is NUL a valid Unicode character? (I found a pdf of ISO 10646 (2nd edition, 2010) which doesn’t seem to exclude U+0000.) Did ISO 10646 or Unicode change between the 2000 edition and the 2010 edition to include control characters that were previously excluded? And as for XML, is there a reason that the text is so lenient/sloppy while the BNF is strict?

4

4 Answers

3
votes

Question: What is/was a legal Unicode character?

The Unicode Glossary defines it thus:

Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).]


Is NUL a valid Unicode character? (I found a pdf of ISO 10646 (2nd edition, 2010) which doesn’t seem to exclude U+0000.)

NUL is a codepoint, and it falls under the definition of "abstract character" so it is a character by sense 2 above.


Did ISO 10646 or Unicode change between the 2000 edition and the 2010 edition to include control characters that were previously excluded?

NUL has been a control character from early versions. Appendix D contains a list of changes.

It says in table D.2 that there have been 65 control characters from Version 1 through Version 3 without change.

Table D-2 documents the number of characters assigned in the different versions of the Unicode standard.

         V1.0 V1.1 V2.0 V2.1 V3.0
...
Controls   65   65   65   65   65

And as for XML, is there a reason that the text is so lenient/sloppy while the BNF is strict?

Writing specifications that are both complete and succinct is hard. When the text disagrees with the BNF, trust the BNF.

1
votes

The use of the word “character” is intentionally fuzzy in the Unicode standard, but mostly it is used in a technical sense: a code point designated as an assigned character code point. This does not completely coincide with the intuitive concept of character. For example, the intuitive character that consists of letter i with macron and grave accent does not exist as a code point; in Unicode, it can only be represented as a sequence of two or three code points. As another example, the so-called control characters are not characters in the intuitive sense.

When other standards and specifications refer to “Unicode characters,” they refer to code points designated as assigned character code points. The set of Unicode characters varies by Unicode standard version, since new code points are assigned. Technically, the UnicodeData.txt file (at ftp://ftp.unicode.org/Public/UNIDATA/) indicates which code points are characters.

U+0000, conventionally denoted by NUL, has been a Unicode character since the beginning.

The XML specifications are inexact in many ways as regards to characters, as you have observed. But the essential definition is the BNF production for “Char” and the statement “XML processors MUST accept any character in the range specified for Char.” This means that in XML specifications, the concept of character is broader than Unicode character. The ranges in the production contain unassigned code points, actually a huge number of them.

The comment to the “Char” production in XML specifications is best ignored. It is very confusing and even incorrect. The “Char” production simply refers to a set of Unicode code points (different sets in different versions of XML). The set includes code points that you should never use in character data, as well as code points that should be avoided for various reasons. But such rules are at a level different from the formal rules of XML and requirements on XML implementations.

When selecting or writing a routine for checking character data, it depends on the application and purpose what should be accepted and what should be done with code points that fail the test. Even surrogate code points might be processed in some way instead of being just discarded; they may well appear due to confusions with encodings (or e.g. when a Java string has been naively taken as a string of Unicode characters – it is as such just a sequence of 16-bit code units).

1
votes

I would ignore the verbage and just focus on the definitions:

XML 1.0:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Document authors are encouraged to avoid "compatibility characters", as defined in section 2.3 of [Unicode]. The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:

[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF], [#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF], [#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF], [#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF], [#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF], [#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF], [#x10FFFE-#x10FFFF].

XML 1.1:

Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]

Document authors are encouraged to avoid "compatibility characters", as defined in Unicode [Unicode]. The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:

[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF], [#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF], [#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF], [#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF], [#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF], [#x10FFFE-#x10FFFF].

0
votes

It sounds stupid because it is stupid. The First Edition of XML (1998) read "the legal graphic characters of Unicode." For whatever reason, the word "graphic" was removed from the Second Edition of 2000, perhaps because it is inaccurate: XML allows many characters that are not graphic characters.

The definition in the Char production is indeed the right place to look.