String encoding and decoding from possibly latin1 and utf8

Question

I recently stumbled upon a MySQL database that was encoded using Latin1 and was rendering when viewed on a browser question mark symbols. To fix this we changed the encoding of the DB to utf8 and the Collation to utf8_general_ci on all of our tables, but the data already stored was still showing up with question mark symbols, all of the storing and polling of data from mysql to the browser was done by php i made sure utf8 was used on php as well and even ran set names utf8 as many people suggested online, the problem is that now I ended up with weird characters such as ÃƒÂ‘ on strings we knew didn't had them.

Examples of data

Stored:

EMMANUEL PE\xc3\u0192\xc2\u2018A GOMEZ PORTUGAL

Rendered:

EMMANUEL PEÃƒÂ‘A GOMEZ PORTUGAL

Proper:

EMMANUEL PEÑA GOMEZ PORTUGAL

Stored:

Luis Hern\xe1ndez-Higareda

Rendered:

Luis Hernández-Higareda

Proper:

Luis Hernández-Higareda

Stored:

Teresa de Jes\xc3\u0192\xc2\xbas Galicia G\xc3\u0192\xc2\xb3mez

Rendered:

Teresa de JesÃƒÂºs Galicia GÃƒÂ³mez

Proper:

Teresa de Jesús Galicia Gómez

Stored:

DR. JOS\xc3\u0192\xc2\u2030 ABEN\xc3\u0192\xc2\x81MAR RIC\xc3\u0192\xc2\x81RDEZ GARC\xc3\u0192\xc2\x8dA

Proper:

DR. JOSÃƒÂ‰ ABENÃƒÂMAR RICÃƒÂRDEZ GARCÃƒÂA

Currently I'm using python to get the data from the DB, I'm trying to normalize to unicode utf8 but I'm really lost, thats as far as I'm getting here, I need to convert what currently shows up as weird characters to readable text as shown above.

what am I missing here? is the data on unrepairable?

Functions https://gist.github.com/2649463

Note: of all of the examples there's 1 that is properly rendering (left there so consideration is taken if any advice is given on how to fix this )

Thanasis Petsas Thanasis Petsas · Accepted Answer · 2012-05-09T23:10:24

try this:

print str.encode('cp1252').decode('utf-8').encode('cp1252').decode('utf-8')

an example using ipython:

In [49]: a=u'Teresa de Jes\xc3\u0192\xc2\xbas Galicia G\xc3\u0192\xc2\xb3mez'

In [50]: a=u'Teresa de Jes\xc3\u0192\xc2\xbas Galicia G\xc3\u0192\xc2\xb3mez'

In [51]: print a
Teresa de JesÃƒÂºs Galicia GÃƒÂ³mez

In [52]: print a.encode('cp1252').decode('utf-8').encode('cp1252').decode('utf-8')
Teresa de Jesús Galicia Gómez

This is a "mis-encoded" utf-8..

String encoding and decoding from possibly latin1 and utf8

2 Answers