1
votes

I wrote a code to connect to imap and then parse the body information and insert into database. But I am having some problems with accents.

From email header I got this information:

Content-Type: text/html; charset=ISO-8859-1

But, I am not sure if I can trust in this information...

The email was wrote in portuguese, so we have a lot of words with accents. For example, I extract the following phrase from the email source code (using my browser):

"...instalação de eletrônicos..."

So, I connected to imap and fetched some emails:

... typ, data = M.fetch(num, '(RFC822)') ...

When I print the content, I get the following word:

print data[0][1]
instala+º+úo de eletr+¦nicos

I tried to use .decode('utf-8') but I had no success.

instalação de eletrônicos

How can I make it a human readable? My database is in utf-8.

3
What does print(type(data[0][1])); print(repr(data[0][1])) print? - Martijn Pieters
@WinstonEwert - Python 2.7 - Thomas
@MartijnPieters - type: <type 'str'> and "print(repr(" returned accents with the following format: fun\xc3\xa7\xc3\xa3o (sorry, this is another accented word) - Thomas
No, that's exactly what I wanted to see. That's função in UTF8. And .decode('utf8') should work, perhaps you need to show us more code? - Martijn Pieters
@MartijnPieters, I tried: print repr(data[0][1]).decode('utf8') but still showing "fun\xc3\xa7\xc3\xa3o" (you are right about decoded word "função") - Thomas

3 Answers

0
votes

The header says it is using "ISO-8859-1" charset. So you need to decode the string with that encoding.

Try this:

data[0][1].decode('iso-8859-1')
0
votes

Specifying the source code encoding worked for me. It's the code at the top of my example code below. This should be defined at the top of your python file.

#!/usr/bin/python
# -*- coding: iso-8859-15 -*-

value = """...instalação de eletrônicos...""".decode("iso-8859-15")
print value
# prints: ...instalação de eletrônicos...

import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii','ignore')
print value
# prints: ...instalacao de eletronicos...

And now you can do str(value) without an exception as well.

See: http://docs.python.org/2/library/unicodedata.html

This seems to keep all accents:

#!/usr/bin/python
# -*- coding: iso-8859-15 -*-
import unicodedata
value = """...instalação de eletrônicos...""".decode("iso-8859-15")
value = unicodedata.normalize('NFKC', value).encode('utf-8')
print value
print str(value)

# prints (without exceptions/errors):
# ...instalação de eletrônicos...
# ...instalação de eletrônicos...

EDIT:

Do note that with the last version even though the outcome looks the same it doesn't return equal is True. In example:

#!/usr/bin/python
# -*- coding: iso-8859-15 -*-
import unicodedata
inValue = """...instalação de eletrônicos...""".decode("iso-8859-15")
normalizedValue = unicodedata.normalize('NFKC', inValue).encode('utf-8')

try:
    print inValue == normalizedValue
except UnicodeWarning:
    pass
# False

EDIT2:

This returns the same:

normalizedValue = unicode("""...instalação de eletrônicos...""".decode("iso-8859-15")).encode('utf-8')
print normalizedValue 
print str(normalizedValue )

# prints (without exceptions/errors):
# ...instalação de eletrônicos...
# ...instalação de eletrônicos...

Though I'm not sure this will actually be valid for a utf-8 encoded database. Probably not?

0
votes

Thanks for Martijn Pieters. We figured out that the email had two different encode. I had to split this parts and treat individually.