3
votes

I have an sqlite db that has some crazy ascii characters in it and I would like to remove them, but I have no idea how to go about doing it. I googled some stuff and found some people saying to use REGEXP with mysql, but that threw an error saying REGEXP wasn't recognized.

Here is the error I get:

sqlalchemy.exc.OperationalError: (OperationalError) Could not decode to UTF-8 column 'table_name' with text ...

Thanks for the help

2
Are you sure you want to get rid of "crazy" characters? Learning how to deal with all unicode characters is actually kind of fun... - unutbu
so are they ASCII characters or UTF-8? Since your are using SQLAlchemy, it is already handling UTF-8 just fine, but you are probably confused about what to do with it once you get it. docs.python.org/howto/unicode.html - msw
~unutbu: depends on your definition of "fun" ;) I'd certainly call it "useful" and "initially daunting", and "worthwhile" but "fun" never crossed my mind. - msw
I would love to learn how, and I hate to be douchy but I am under a tight deadline right now. Any help or advice would be much appreciated. - imns
You need to show more code. Don't tell us it threw an error, show us. - msw

2 Answers

1
votes

Well, if you really want to shoehorn a rich unicode string into a plain ascii string (and don't mind some goofs), you could use this:

import unicodedata as ud
def shoehorn_unicode_into_ascii(s):
    # This removes accents, but also other things, like ß‘’“”
    return ud.normalize('NFKD', s).encode('ascii','ignore')

For a more complete solution (with somewhat fewer goofs, but requiring a third-party module unidecode), see this answer.

Really, though, the best solution is to work with unicode data throughout your code as much as possible, and drop to an encoding only when necessary.

0
votes

django.utils.encoding has a greate set of robust unicode encoding and decoding functions.