Consider the following exchange on IPython:
In [1]: s = u'華袞與緼????同歸'
In [2]: len(s)
Out[2]: 8
The correct output should have been 7
, but because the fifth of these seven Chinese characters has a high Unicode code-point, it is represented in UTF-8 by a "surrogate pair", rather than just one simple codepoint, and as a result Python thinks it is two characters rather than one.
Even if I use unicodedata
, which returns the surrogate pair correctly as a single codepoint (\U00026177
), when passed to len()
the wrong length is still returned:
In [3]: import unicodedata
In [4]: unicodedata.normalize('NFC', s)
Out[4]: u'\u83ef\u889e\u8207\u7dfc\U00026177\u540c\u6b78'
In [5]: len(unicodedata.normalize('NFC', s))
Out[5]: 8
Without taking drastic steps like recompiling Python for UTF-32, is there a simple way to get the correct length in situations like this?
I'm on IPython 0.13, Python 2.7.2, Mac OS 10.8.2.