14
votes

Consider the following exchange on IPython:

In [1]: s = u'華袞與緼????同歸'

In [2]: len(s)
Out[2]: 8

The correct output should have been 7, but because the fifth of these seven Chinese characters has a high Unicode code-point, it is represented in UTF-8 by a "surrogate pair", rather than just one simple codepoint, and as a result Python thinks it is two characters rather than one.

Even if I use unicodedata, which returns the surrogate pair correctly as a single codepoint (\U00026177), when passed to len() the wrong length is still returned:

In [3]: import unicodedata

In [4]: unicodedata.normalize('NFC', s)
Out[4]: u'\u83ef\u889e\u8207\u7dfc\U00026177\u540c\u6b78'


In [5]: len(unicodedata.normalize('NFC', s))
Out[5]: 8

Without taking drastic steps like recompiling Python for UTF-32, is there a simple way to get the correct length in situations like this?

I'm on IPython 0.13, Python 2.7.2, Mac OS 10.8.2.

3
The discussions here and here seem relevant.DSM
@DSM: Thanks for digging these up. Your first link shows Python compiled for UTF-32 ("wide build"), something I ruled out in my question. In the second, the reply by wberry shows an elaborate piece of code to actually count true characters. My default workaround is like the latter, but I am hoping there exists something built in and more direct.brannerchinese
I can't reproduce your result here (Ubuntu box, python 2.7.2). For the unicode u'\u83ef\u889e\u8207\u7dfc\U00026177\u540c\u6b78' I get a length of seven with both len(s) and len(unicode.normalize('NFC', s))Vicent
It's probably highly version-dependent. Python3.3 should deal more gracefully with this, since, by default, it never creates surrogate pairs(even though you can create them by hand).Bakuriu
It isn't UTF-8 that represents the non-BMP character by a surrogate pair. It is UTF-16, or rather the hack that Python used in versions < 3.3 on narrow builds. (Well, you could take the surrogate pairs as in UTF-16, and encode each of the two surrogates using UTF-8, but this is explicitly prohibited by RFC 3629 though many UTF-8 implementations do it: it's called WTF-8. But the only way a string can get encoded in UTF-8 this way is if it originally came from UTF-16). See chrispy's answer below for a simple solution.ShreevatsaR

3 Answers

8
votes
7
votes

I make a function to do this on Python 2:

SURROGATE_PAIR = re.compile(u'[\ud800-\udbff][\udc00-\udfff]', re.UNICODE)
def unicodeLen(s):
  return len(SURROGATE_PAIR.sub('.', s))

By replacing surrogate pairs with a single character, we 'fix' the len function. On normal strings, this should be pretty efficient: since the pattern won't match, the original string will be returned without modification. It should work on wide (32-bit) Python builds, too, as the surrogate pair encoding will not be used.

3
votes

You can override the len function in Python (see: How does len work?) and add an if statement in it to check for the extra long unicode.