4
votes

There seem to be a lot of posts about doing this in other languages, but I can't seem to figure out how in Python (I'm using 2.7).

To be clear, I would ideally like to keep the string in unicode, just be able to replace certain specific characters.

For instance:

thisToken = u'tandh\u2013bm'
print(thisToken)

prints the word with the m-dash in the middle. I would just like to delete the m-dash. (but not using indexing, because I want to be able to do this anywhere I find these specific characters.)

I try using replace like you would with any other character:

newToke = thisToken.replace('\u2013','')
print(newToke)

but it just doesn't work. Any help is much appreciated. Seth

2
if you use from __future__ import unicode_literals at the top of your file, all string literals are automatically unicode, and it would have helped here (but watch out for surprises when some strings need to be bytes, you can use the b prefix for them).RemcoGerlich

2 Answers

7
votes

The string you're searching for to replace must also be a Unicode string. Try:

newToke = thisToken.replace(u'\u2013','')
0
votes

You can see the answer in this post: How to replace unicode characters in string with something else python?

Decode the string to Unicode. Assuming it's UTF-8-encoded:

str.decode("utf-8")

Call the replace method and be sure to pass it a Unicode string as its first argument:

str.decode("utf-8").replace(u"\u2022", "")

Encode back to UTF-8, if needed:

str.decode("utf-8").replace(u"\u2022", "").encode("utf-8")