2
votes

I'm working with the BeautifulSoup python library. I used the urllib2 library to download the HTML code from a page, and then I have parsed it with BeautifulSoup. I want to save some of the HTML content into a MySql table, but I'm having some problems with the encoding. The MySql table is encoded with 'utf-8' charset.

Some examples:

When I download the HTML code and parse it with BeautifulSoup I have something like:

"Ver las \xc3\xbaltimas noticias. Ent\xc3\xa9rate de las noticias de \xc3\xbaltima hora con la mejor cobertura con fotos y videos"

The correct text would be:

"Ver las últimas noticias. Entérate de las noticias de última hora con la mejor cobertura con fotos y videos"

I have tried to encode and decode that text with multiple charsets, but when I insert it into MySql I have somethig like:

"Ver las últimas noticias y todos los titulares de hoy en Yahoo! Noticias Argentina. Entérate de las noticias de última hora con la mejor cobertura con fotos y videos"

I'm having problems with the encoding, but I don't know how to solve them.

Any suggestion?

2

2 Answers

3
votes

You have correct UTF-8 data coming out of BeautifulSoup, but it's being stored in a normal string type, not python's native unicode string type. I think this is what you need to do:

codecs.decode(your_string, 'utf-8')

And then the string should be the proper data type and encoding to send to mysql.

An example:

>>> codecs.decode("Ver las \xc3\xbaltimas noticias. Ent\xc3\xa9rate de las noticias de \xc3\xbaltima hora con la mejor cobertura con fotos y videos", 'utf-8')
u'Ver las \xfaltimas noticias. Ent\xe9rate de las noticias de \xfaltima hora con la mejor cobertura con fotos y videos'
>>> print _
Ver las últimas noticias. Entérate de las noticias de última hora con la mejor cobertura con fotos y videos
2
votes

BeautifulSoup returns all data as unicode strings. First triple check that the unicode strings are ccorrect. If not then there is some issue with the encoding of the input data.