Wrong encoding with Python BeautifulSoup + MySql

Question

I'm working with the BeautifulSoup python library. I used the urllib2 library to download the HTML code from a page, and then I have parsed it with BeautifulSoup. I want to save some of the HTML content into a MySql table, but I'm having some problems with the encoding. The MySql table is encoded with 'utf-8' charset.

Some examples:

When I download the HTML code and parse it with BeautifulSoup I have something like:

"Ver las \xc3\xbaltimas noticias. Ent\xc3\xa9rate de las noticias de \xc3\xbaltima hora con la mejor cobertura con fotos y videos"

The correct text would be:

"Ver las últimas noticias. Entérate de las noticias de última hora con la mejor cobertura con fotos y videos"

I have tried to encode and decode that text with multiple charsets, but when I insert it into MySql I have somethig like:

"Ver las Ãºltimas noticias y todos los titulares de hoy en Yahoo! Noticias Argentina. EntÃ©rate de las noticias de Ãºltima hora con la mejor cobertura con fotos y videos"

I'm having problems with the encoding, but I don't know how to solve them.

Any suggestion?

Mu Mind Mu Mind · Accepted Answer · 2011-05-05T19:41:36

You have correct UTF-8 data coming out of BeautifulSoup, but it's being stored in a normal string type, not python's native unicode string type. I think this is what you need to do:

codecs.decode(your_string, 'utf-8')

And then the string should be the proper data type and encoding to send to mysql.

An example:

>>> codecs.decode("Ver las \xc3\xbaltimas noticias. Ent\xc3\xa9rate de las noticias de \xc3\xbaltima hora con la mejor cobertura con fotos y videos", 'utf-8')
u'Ver las \xfaltimas noticias. Ent\xe9rate de las noticias de \xfaltima hora con la mejor cobertura con fotos y videos'
>>> print _
Ver las últimas noticias. Entérate de las noticias de última hora con la mejor cobertura con fotos y videos

Wrong encoding with Python BeautifulSoup + MySql

2 Answers