0
votes

I'm using Python 2.7 here (which is very relevant). Let's say I have a string containing an "em" dash, "—". This isn't encoded in ASCII. Therefore, when my Django app processes it, it complains. A lot. I want to to replace some such characters with unicode equivalents for string tokenization and use with a spell-checking API (PyEnchant, which considers non-ASCII apostrophes to be misspellings), for example by using the shorter "-" dash instead of an em dash. Here's what I'm doing:

s = unicode(s).replace(u'\u2014', '-').replace(u'\u2018', "'").replace(u'\u2019', "'").replace(u'\u201c', '"').replace(u'\u201d', '"')

Unfortunately, this isn't actually replacing any of the unicode characters, and I'm not sure why. I don't really have time to upgrade to Python 3 right now, importing unicode_literals from future at the top of the page or setting the encoding there does not let me place actual unicode literals in the code, as it should, and I have tried endless tricks with encode() and decode(). Can anyone give me a straightforward, failsafe way to do this in Python 2.7?

1
You should fix the places where your app "complains" rather than doing this. Python 2.7 and Django are quite capable of dealing in text encoded other than in ASCII.Daniel Roseman
@DanielRoseman I would agree, but the spell-checking API is treating all occurrences of non-ASCII characters as spelling errors, and I'm trying to work around thatAndrew Puglionesi

1 Answers

0
votes

Oh boy... false alarm, here! It actually works, but I entered some incorrect character codes. I'm going to leave the question up since that code is the only thing that seemed to let me complete this particular task in this environment.