Get unicode code point of a character using Python

71

votes

In Python API, is there a way to extract the unicode code point of a single character?

Edit: In case it matters, I'm using Python 2.7.

+1.. Had no idea what unicode code points were before reading this :) - Demian Brecht

e.g. for '\u304f' I want '304f'. is that what 'ord()' will do? Yes- docs.python.org/library/functions.html#ord - SK9

Yes, ord("\N{HIRAGANA LETTER KU}") is indeed 12367, aka 0x304F. I would never use numbers for characters the way you do, only named ones the way I do. Magic numbers are bad for your program. Just think of chr and ord as inverse functions of each other. It’s really easy. - tchrist

@tchrist it might be worth noting chr is the opposite of ord in python 3.x, but in python 2.x unichr is the inverse of ord as chr only works for ordinals up to 255 in python 2.x. - cryo

@tchrist there are still lots of people using python 2.x. Even in python 3.x there are still narrow Unicode builds (for example most Windows builds of python 3.x are narrow.) I wouldn't call most 2.x Unicode issues bugs so much as additions to maintain backwards compatibility with older scripts, python 2.x usually works fine with Unicode. python 3.0 does make things much more consistent though, eliminating the difference between str and unicode. - cryo

62

votes

>>> ord(u"ć")
263
>>> u"café"[2]
u'f'
>>> u"café"[3]
u'\xe9'
>>> for c in u"café":
...     print repr(c), ord(c)
... 
u'c' 99
u'a' 97
u'f' 102
u'\xe9' 233

72

votes

If I understand your question correctly, you can do this.

>>> s='㈲'
>>> s.encode("unicode_escape")
b'\\u3232'

Shows the unicode escape code as a source string.

12

votes

Usually, you just do ord(character) to find the code point of a character. For completeness though, wide characters in the Unicode Supplementary Multilingual Plane are represented as surrogate pairs (i.e. two code units) in narrow Python builds, so in that case I often needed to do this small work-around:

def get_wide_ordinal(char):
    if len(char) != 2:
        return ord(char)
    return 0x10000 + (ord(char[0]) - 0xD800) * 0x400 + (ord(char[1]) - 0xDC00)

This is rare in most applications though, so normally just use ord().

10

votes

Turns out getting this right is fairly tricky: Python 2 and Python 3 have some subtle issues with extracting Unicode code points from a string.

Up until Python 3.3, it was possible to compile Python in one of two modes:

sys.maxunicode == 0x10FFFF

In this mode, Python's Unicode strings support the full range of Unicode code points from U+0000 to U+10FFFF. One code point is represented by one string element:

>>> import sys
>>> hex(sys.maxunicode)
'0x10ffff'
>>> len(u'\U0001F40D')
1
>>> [c for c in u'\U0001F40D']
[u'\U0001f40d']

This is the default for Python 2.7 on Linux, as well as universally on Python 3.3 and later across all operating systems.

sys.maxunicode == 0xFFFF

In this mode, Python's Unicode strings only support the range of Unicode code points from U+0000 to U+FFFF. Any code points from U+10000 through U+10FFFF are represented using a pair of string elements in the UTF-16 encoding::

>>> import sys
>>> hex(sys.maxunicode)
'0xffff'
>>> len(u'\U0001F40D')
2
>>> [c for c in u'\U0001F40D']
[u'\ud83d', u'\udc0d']

This is the default for Python 2.7 on macOS and Windows.

This runtime difference makes writing Python modules to manipulate Unicode strings as series of codepoints quite inconvenient.

The codepoints module

To solve this, I contributed a new module codepoints to PyPI:

https://pypi.python.org/pypi/codepoints/1.0

This module solves the problem by exposing APIs to convert Unicode strings to and from lists of code points, regardless of the underlying setting for sys.maxunicode::

>>> hex(sys.maxunicode)
'0xffff'
>>> snake = tuple(codepoints.from_unicode(u'\U0001F40D'))
>>> len(snake)
1
>>> snake[0]
128013
>> hex(snake[0])
'0x1f40d'
>>> codepoints.to_unicode(snake)
u'\U0001f40d'

3

votes

python2

>>> print hex(ord(u'人'))
0x4eba

Get unicode code point of a character using Python

5 Answers

The codepoints module