1
votes

I'm writing a script to create tables containing unicode characters for case folding, etc.

I was able to extract those tables just fine, but I'm struggling to figure out which properties to use to get codepoints for normalization.

In Unicode Annex #44 the closest property group I can find is the NF(C|D|KC|KD)_QC which is for telling if a string has already been normalized.

and it still doesn't list the values I need to actually build the tables.

What am I doing wrong here?

Edit: I'm writing a C library to handle unicode, this isn't a simple one and done, write it in python problem, I'm trying to write my own normalization (technically composition/decomposition) functions.

Edit2: The decomposition property is "dm", but what about composition, and the Kompatibility variants?

1
You should familiarize yourself thoroughly with UAX #15. It not only answers these questions, but covers why using the UCD alone is insufficient.user3942918
and the unicode standard (in one of the first chapters) you have the rules and algorithm for normalization (and how to handle "future" characters)Giacomo Catenazzi

1 Answers

2
votes

The Unicode XML database in the ucdxml directory is not authoritative. I'd suggest to work with the authoritative files in the ucd directory. You'll need

If there's a decomposition type in angle brackets, it's a compatibility mapping (NFKD), otherwise it's a canonical mapping. Composition is defined in terms of decomposition mappings. See section 3.11 Normalization Forms of the Unicode standard and UAX #15 for details.