17
votes

If I accept full Unicode for passwords, how should I normalize the string before passing it to the hash function?

Goals

Without normalization, if someone sets their password to "mañana" (ma\u00F1ana) on one computer and tries to log in with "mañana" (ma\u006E\u0303ana) on another computer, the hashes will be different and the login will fail. This is under the control of the user-agent or its operating system.

  • I'd like to ensure that those hash to the same thing.
  • I am not concerned about homoglyphs such as Α, А, and A (Greek, Cyrillic, Latin).

Reference

Unicode normalization forms: http://unicode.org/reports/tr15/#Norm_Forms

Considerations

  • Any normalization procedure may cause collisions, e.g. "office" == "office".
  • Normalization can change the number of bytes in the string.

Further questions

  • What happens if the server receives a byte sequence that is not valid UTF-8 (or other format)? Reject, since it can't be normalized?
  • What happens if the server receives characters that are unassigned in its version of Unicode?
1
Are you primarily concerned with users using different input methods on different devices? Your example include ligatures, but what about zero width joiners and combiners? What about similar but semantically distinct code-points like I (Latin Letter) vs Ⅰ (Roman Numeral) vs I (CJK Full-width)?Mike Samuel
I'm not concerned about homoglyphs -- it's unlikely they'll be able to type their entire password using an input method that only shares some (near-)glyphs -- but I'll have to think about joiners. It may be that preparing Unicode for password hashing needs a much more thorough approach.treat your mods well

1 Answers

12
votes

Normalization is undefined in case of malformed inputs, such as alleged UTF-8 text that contains illegal byte sequences. Illegal bytes may be interpreted differently in different environments: Rejection, replacement, or omission.

Recommendation #1: If possible, reject inputs that do not conform to the expected encoding. (This may be out of the application's control, however.)

The Unicode Annex 15 guarantees normalization stability when the input contains assigned characters only:

11.1 Stability of Normalized Forms

For all versions, even prior to Unicode 4.1, the following policy is followed:

A normalized string is guaranteed to be stable; that is, once normalized, a string is normalized according to all future versions of Unicode.

More precisely, if a string has been normalized according to a particular version of Unicode and contains only characters allocated in that version, it will qualify as normalized according to any future version of Unicode.

Recommendation #2: Whichever normalization form is used must use the Normalization Process for Stabilized Strings, i.e., reject any password inputs that contain unassigned characters, since their normalization is not guaranteed stable under server upgrades.

The compatibility normalization forms seem to handle Japanese better, collapsing several decompositions into the same output where the canonical forms do not.

The spec warns:

Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text.

However, semantics and round-tripping are not of concern here.

Recommendation #3: Apply NFKC or NFKD before hashing.