converting string to unicode type in python

Question

I'm trying this code:

s = "سلام"
'{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))

but this error occurs:

'{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd3 in position 0: ordinal not in range(128)

I tried '{:b}'.format(int(s.encode('utf-8').encode('hex'), 16)) but nothing changed.

what should I do?

Please copy and paste the text of a traceback, not a screenshot. — Martijn Pieters
You have a bytestring, not unicode. s is already encoded in whatever codec your terminal uses. — Martijn Pieters
yes, if I change it to s = u'سلام' everything solves but it's a variable which I receive from user by a simple input. It's not a static string. how can I put different strings in u'' ? — Aidin.T
Input in the terminal is encoded with the sys.stdin.encoding codec. You can use that to decode to Unicode. — Martijn Pieters

georg georg · Accepted Answer · 2013-10-08T21:27:23

Since you're using python 2, s = "سلام" is a byte string (in whatever encoding your terminal uses, presumably utf8):

>>> s = "سلام"
>>> s
'\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'

You cannot encode byte strings (as they are already "encoded"). You're looking for unicode ("real") strings, which in python2 must be prefixed with u:

>>> s = u"سلام"
>>> s
u'\u0633\u0644\u0627\u0645'
>>> '{:b}'.format(int(s.encode('utf-8').encode('hex'), 16))
'1101100010110011110110011000010011011000101001111101100110000101'

If you're getting a byte string from a function such as raw_input then your string is already encoded - just skip the encode part:

'{:b}'.format(int(s.encode('hex'), 16))

or (if you're going to do anything else with it) convert it to unicode:

s = s.decode('utf8')

This assumes that your input is UTF-8 encoded, if this might not be the case, check sys.stdin.encoding first.

i10n stuff is complicated, here are two articles that will help you further:

converting string to unicode type in python

1 Answers