How do I properly implement Unicode passwords?

Question

Adding support for Unicode passwords it an important feature that should not be ignored by developers.

Still, adding support for Unicode in passwords is a tricky job because the same text can be encoded in different ways in Unicode and you don't want to prevent people from logging in because of this.

Let's say that you'll store the passwords as UTF-8, and mind that this question is not related to Unicode encodings and it's related to Unicode normalization.

Now the question is how you should normalize the Unicode data?

You have to be sure that you'll be able to compare it. You need to be sure that when the next Unicode standard will be released it will not invalidate your password verification.

Note: still there are some places where Unicode passwords will probably never be used, but this question is not about why or when to use Unicode passwords, it is about how to implement them in the proper way.

1st update

Is it possible to implement this without using ICU, like using OS for normalizing?

What difference does it make when another unicode standard is released? You've made the decision to store the password in UTF-8 - so store the password in UTF-8. Committees can release new standards without you being forced to change the way you store your data. — Dominic Rodger
Unicode does not dictate encoding. It's just a list of characters that each has a number associated with it (basically). If you choose UTF-8 I don't how this encoding can change in the future in a way that breaks compatibility. — Assaf Lavie
There are multiple ways of encoding the same visual characters, I am assuming that this is what he wants to know how to cope with. — Lasse V. Karlsen
Maybe I wasn't clear enough, this is not about Unicode encodings, it's about normalization of Unicode text, process that is required in order to be able to compare the strings. I modified the question to clarify this. — sorin

D.Shawley D.Shawley · Accepted Answer · 2010-05-09T19:34:14

A good start is to read Unicode TR 15: Unicode Normalization Forms. Then you realize that it is a lot of work and prone to strange errors - you probably already know this part since you are asking here. Finally, you download something like ICU and let it do it for you.

IIRC, it is a multistep process. First you decompose the sequence until you cannot further decompose - for example é would become e + ´. Then you reorder the sequences into a well-defined ordering. Finally, you can encode the resulting byte stream using UTF-8 or something similar. The UTF-8 byte stream can be fed into the cryptographic hash algorithm of your choice and stored in a persistent store. When you want to check if a password matches, perform the same procedure and compare the output of the hash algorithm with what is stored in the database.

How do I properly implement Unicode passwords?

1st update

2 Answers