It is really confusing to handle non-ascii code char in python. Can any one explain?
I'm trying to read a plain text file and replace all non-alphabetic characters with spaces.
I have a list of characters:
ignorelist = ('!', '-', '_', '(', ')', ',', '.', ':', ';', '"', '\'', '?', '#', '@', '$', '^', '&', '*', '+', '=', '{', '}', '[', ']', '\\', '|', '<', '>', '/', u'—')
for each token i got, i replace any char in that token with space by calling
for punc in ignorelist:
token = token.replace(punc, ' ')
notice there's a non ascii code character at the end of ignorelist
: u'—'
Everytime when my code encounters that character, it crashes and say:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
I tried to declare the encoding by adding # -*- coding: utf-8 -*-
at the top of the file, but still not working. anyone knows why? Thanks!