3
votes

I understand that ES6 will have a new function that will do a utf-8 normalization of a string (using the 'NFC' form for example).

Reading http://www.unicode.org/faq/normalization.html, I saw this FAQ :

Q: What is the difference is between W3C normalization and Unicode normalization?

A: Unicode normalization comes in 4 flavors: C, D, KC, KD. It is C that is relevant for W3C normalization. W3C normalization also treats character references (&#nnnn;) as equivalent to characters. For example, the text string "a&#xnnnn;" (where nnnn = "0301") is Unicode-normalized since it consists only of ASCII characters, but it is not W3C-normalized, since it contains a representation of a combining acute accent with "a", and in normalization form C, that should have been normalized to U+00E1.

does that mean that we will need to replace all occurrences of &#xnnnn; by their utf8 equivalents before calling normalize('nfc') ?

or will there be some sort of normalize('w3c') that will help consider a letter combined with an accent via the ascii "&#xnnnn;" equivalent to its normalized form ?

1

1 Answers

1
votes

When your javascript executes the &...; is already gone, if you handle the DOM. The only time you would see that is if you download and html somehow. And, anyway, converting the &...; to the proper character is un-escaping, not normalization. So you would have to un-escape, then normalize.