Python: getting correct string length when it contains surrogate pairs

Question

Consider the following exchange on IPython:

In [1]: s = u'華袞與緼????同歸'

In [2]: len(s)
Out[2]: 8

The correct output should have been 7, but because the fifth of these seven Chinese characters has a high Unicode code-point, it is represented in UTF-8 by a "surrogate pair", rather than just one simple codepoint, and as a result Python thinks it is two characters rather than one.

Even if I use unicodedata, which returns the surrogate pair correctly as a single codepoint (\U00026177), when passed to len() the wrong length is still returned:

In [3]: import unicodedata

In [4]: unicodedata.normalize('NFC', s)
Out[4]: u'\u83ef\u889e\u8207\u7dfc\U00026177\u540c\u6b78'


In [5]: len(unicodedata.normalize('NFC', s))
Out[5]: 8

Without taking drastic steps like recompiling Python for UTF-32, is there a simple way to get the correct length in situations like this?

I'm on IPython 0.13, Python 2.7.2, Mac OS 10.8.2.

@DSM: Thanks for digging these up. Your first link shows Python compiled for UTF-32 ("wide build"), something I ruled out in my question. In the second, the reply by wberry shows an elaborate piece of code to actually count true characters. My default workaround is like the latter, but I am hoping there exists something built in and more direct. — brannerchinese
I can't reproduce your result here (Ubuntu box, python 2.7.2). For the unicode u'\u83ef\u889e\u8207\u7dfc\U00026177\u540c\u6b78' I get a length of seven with both len(s) and len(unicode.normalize('NFC', s)) — Vicent
It's probably highly version-dependent. Python3.3 should deal more gracefully with this, since, by default, it never creates surrogate pairs(even though you can create them by hand). — Bakuriu
It isn't UTF-8 that represents the non-BMP character by a surrogate pair. It is UTF-16, or rather the hack that Python used in versions < 3.3 on narrow builds. (Well, you could take the surrogate pairs as in UTF-16, and encode each of the two surrogates using UTF-8, but this is explicitly prohibited by RFC 3629 though many UTF-8 implementations do it: it's called WTF-8. But the only way a string can get encoded in UTF-8 this way is if it originally came from UTF-16). See chrispy's answer below for a simple solution. — ShreevatsaR

Ecir Hana Ecir Hana · Accepted Answer · 2012-10-20T16:10:51

I think this has been fixen in 3.3. See:

http://docs.python.org/py3k/whatsnew/3.3.html
http://www.python.org/dev/peps/pep-0393/ (search for wstr_length)

Python: getting correct string length when it contains surrogate pairs

3 Answers