How do I extract Unicode normalization tables from the XML Unicode Character Database?

Question

I'm writing a script to create tables containing unicode characters for case folding, etc.

I was able to extract those tables just fine, but I'm struggling to figure out which properties to use to get codepoints for normalization.

In Unicode Annex #44 the closest property group I can find is the NF(C|D|KC|KD)_QC which is for telling if a string has already been normalized.

and it still doesn't list the values I need to actually build the tables.

What am I doing wrong here?

Edit: I'm writing a C library to handle unicode, this isn't a simple one and done, write it in python problem, I'm trying to write my own normalization (technically composition/decomposition) functions.

Edit2: The decomposition property is "dm", but what about composition, and the Kompatibility variants?

You should familiarize yourself thoroughly with UAX #15. It not only answers these questions, but covers why using the UCD alone is insufficient. — user3942918
and the unicode standard (in one of the first chapters) you have the rules and algorithm for normalization (and how to handle "future" characters) — Giacomo Catenazzi

nwellnhof nwellnhof · Accepted Answer · 2018-04-24T09:16:43

The Unicode XML database in the ucdxml directory is not authoritative. I'd suggest to work with the authoritative files in the ucd directory. You'll need

the fields Decomposition_Type and Decomposition_Mapping from column 5 of UnicodeData.txt,
the field Canonical_Combining_Class from column 3, and
the composition exclusions from CompositionExclusions.txt.

If there's a decomposition type in angle brackets, it's a compatibility mapping (NFKD), otherwise it's a canonical mapping. Composition is defined in terms of decomposition mappings. See section 3.11 Normalization Forms of the Unicode standard and UAX #15 for details.

How do I extract Unicode normalization tables from the XML Unicode Character Database?

1 Answers