518
votes

How do you convert a Unicode string (containing extra characters like £ $, etc.) into a Python string?

10
We need to know what Python version you are using, and what it is that you are calling a Unicode string. Do the following on a short unicode_string that includes the currency symbols that are causing the bother: Python 2.x : print type(unicode_string), repr(unicode_string) Python 3.x : print type(unicode_string), ascii(unicode_string) Then edit your question and copy/paste the results of the above print statement. DON'T retype the results. Also look up near the top of your HTML and see if you can find something like this: <meta http-equiv="Content-Type" content="text/html; charset=iso-8859John Machin
I doubt the you get unicode from a web request. You probalby get UTF-8 encoded Unicode.lutz
@lutz: how exactly is "UTF-8 encoded Unicode" not unicode?jalf
You should really clarify what you mean by unicode string and python string (giving concrete examples would be the best I guess) as it's clear from comments there are different interpretations of your question. I wonder why you haven't done this although it's over 3,5 years since you asked this question.Piotr Dobrogost
@jalf: If it is encoded; it is no longer Unicode e.g., unicode_string = u"I'm unicode string"; bytestring = unicode_string.encode('utf-8'); unicode_again = bytestring.decode('utf-8')jfs

10 Answers

595
votes

See unicodedata.normalize

title = u"Klüft skräms inför på fédéral électoral große"
import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii', 'ignore')
'Kluft skrams infor pa federal electoral groe'
331
votes

You can use encode to ASCII if you don't need to translate the non-ASCII characters:

>>> a=u"aaaàçççñññ"
>>> type(a)
<type 'unicode'>
>>> a.encode('ascii','ignore')
'aaa'
>>> a.encode('ascii','replace')
'aaa???????'
>>>
147
votes
>>> text=u'abcd'
>>> str(text)
'abcd'

If the string only contains ascii characters.

117
votes

If you have a Unicode string, and you want to write this to a file, or other serialised form, you must first encode it into a particular representation that can be stored. There are several common Unicode encodings, such as UTF-16 (uses two bytes for most Unicode characters) or UTF-8 (1-4 bytes / codepoint depending on the character), etc. To convert that string into a particular encoding, you can use:

>>> s= u'£10'
>>> s.encode('utf8')
'\xc2\x9c10'
>>> s.encode('utf16')
'\xff\xfe\x9c\x001\x000\x00'

This raw string of bytes can be written to a file. However, note that when reading it back, you must know what encoding it is in and decode it using that same encoding.

When writing to files, you can get rid of this manual encode/decode process by using the codecs module. So, to open a file that encodes all Unicode strings into UTF-8, use:

import codecs
f = codecs.open('path/to/file.txt','w','utf8')
f.write(my_unicode_string)  # Stored on disk as UTF-8

Do note that anything else that is using these files must understand what encoding the file is in if they want to read them. If you are the only one doing the reading/writing this isn't a problem, otherwise make sure that you write in a form understandable by whatever else uses the files.

In Python 3, this form of file access is the default, and the built-in open function will take an encoding parameter and always translate to/from Unicode strings (the default string object in Python 3) for files opened in text mode.

59
votes

Here is an example:

>>> u = u'€€€'
>>> s = u.encode('utf8')
>>> s
'\xe2\x82\xac\xe2\x82\xac\xe2\x82\xac'
7
votes

file contain unicode-esaped string

\"message\": \"\\u0410\\u0432\\u0442\\u043e\\u0437\\u0430\\u0446\\u0438\\u044f .....\",

for me

 f = open("56ad62-json.log", encoding="utf-8")
 qq=f.readline() 

 print(qq)                          
 {"log":\"message\": \"\\u0410\\u0432\\u0442\\u043e\\u0440\\u0438\\u0437\\u0430\\u0446\\u0438\\u044f \\u043f\\u043e\\u043b\\u044c\\u0437\\u043e\\u0432\\u0430\\u0442\\u0435\\u043b\\u044f\"}

(qq.encode().decode("unicode-escape").encode().decode("unicode-escape")) 
# '{"log":"message": "Авторизация пользователя"}\n'
5
votes

Well, if you're willing/ready to switch to Python 3 (which you may not be due to the backwards incompatibility with some Python 2 code), you don't have to do any converting; all text in Python 3 is represented with Unicode strings, which also means that there's no more usage of the u'<text>' syntax. You also have what are, in effect, strings of bytes, which are used to represent data (which may be an encoded string).

http://docs.python.org/3.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

(Of course, if you're currently using Python 3, then the problem is likely something to do with how you're attempting to save the text to a file.)

4
votes

Here is an example code

import unicodedata    
raw_text = u"here $%6757 dfgdfg"
convert_text = unicodedata.normalize('NFKD', raw_text).encode('ascii','ignore')
3
votes

There is a library that can help with Unicode issues called ftfy. Has made my life easier.

Example 1

import ftfy
print(ftfy.fix_text('ünicode'))

output -->
ünicode

Example 2 - UTF-8

import ftfy
print(ftfy.fix_text('\xe2\x80\xa2'))

output -->
•

Example 3 - Unicode code point

import ftfy
print(ftfy.fix_text(u'\u2026'))

output -->
…

https://ftfy.readthedocs.io/en/latest/

pip install ftfy

https://pypi.org/project/ftfy/

2
votes

No answere worked for my case, where I had a string variable containing unicode chars, and no encode-decode explained here did the work.

If I do in a Terminal

echo "no me llama mucho la atenci\u00f3n"

or

python3
>>> print("no me llama mucho la atenci\u00f3n")

The output is correct:

output: no me llama mucho la atención

But working with scripts loading this string variable didn't work.

This is what worked on my case, in case helps anybody:

string_to_convert = "no me llama mucho la atenci\u00f3n"
print(json.dumps(json.loads(r'"%s"' % string_to_convert), ensure_ascii=False))
output: no me llama mucho la atención