handle non ascii code string in python

Question

It is really confusing to handle non-ascii code char in python. Can any one explain?

I'm trying to read a plain text file and replace all non-alphabetic characters with spaces.

I have a list of characters:

ignorelist = ('!', '-', '_', '(', ')', ',', '.', ':', ';', '"', '\'', '?', '#', '@', '$', '^', '&', '*', '+', '=', '{', '}', '[', ']', '\\', '|', '<', '>', '/', u'—')

for each token i got, i replace any char in that token with space by calling

    for punc in ignorelist:
        token = token.replace(punc, ' ')

notice there's a non ascii code character at the end of ignorelist: u'—'

Everytime when my code encounters that character, it crashes and say:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position

I tried to declare the encoding by adding # -*- coding: utf-8 -*- at the top of the file, but still not working. anyone knows why? Thanks!

lilydjwg lilydjwg · Accepted Answer · 2013-04-01T03:08:03

You are using Python 2.x, and it will try to auto-convert unicodes and plain strs, but it often fails with non-ascii characters.

You shouldn't mix unicodes and strs together. You can either stick to unicodes:

ignorelist = (u'!', u'-', u'_', u'(', u')', u',', u'.', u':', u';', u'"', u'\'', u'?', u'#', u'@', u'$', u'^', u'&', u'*', u'+', u'=', u'{', u'}', u'[', u']', u'\\', u'|', u'<', u'>', u'/', u'—')

if not isinstance(token, unicode):
    token = token.decode('utf-8') # assumes you are using UTF-8
for punc in ignorelist:
    token = token.replace(punc, u' ')

or use only plain strs (note the last one):

ignorelist = ('!', '-', '_', '(', ')', ',', '.', ':', ';', '"', '\'', '?', '#', '@', '$', '^', '&', '*', '+', '=', '{', '}', '[', ']', '\\', '|', '<', '>', '/', u'—'.encode('utf-8'))
# and other parts do not need to change

By manually encoding your u'—' into a str, Python won't need to try that by itself.

I suggest you use unicode all across your program to avoid this kind of errors. But if it'd be too much work, you can use the latter method. However, take care when you call some functions in standard library or third party modules.

# -*- coding: utf-8 -*- only tells Python that your code is written in UTF-8 (or you'll get a SyntaxError).

handle non ascii code string in python

2 Answers