2
votes

I have written a script to parse an email. It works fine when receiving letters from Mac OS X Mail client (just this one tested so far), but my parser failes when letters contain unicode letters in their body part.

For example, I have sent a message with content ąčę.

And here is my part of script which parses body and attachments at the same time:

p = FeedParser()
p.feed(msg)
msg = p.close()
attachments = []
body = None
for part in msg.walk():
  if part.get_content_type().startswith('multipart/'):
    continue
  try:
    filename = part.get_filename()
  except:
    # unicode letters in filename, set default name then
    filename = 'Mail attachment'

  if part.get_content_type() == "text/plain" and not body:
    body = part.get_payload(decode=True)
  elif filename is not None:
    content_type = part.get_content_type()
    attachments.append(ContentFile(part.get_payload(decode=True), filename))

if body is None:
    body = ''

Well, I mentioned that it works with letters from OS X Mail, but with Gmail letters it doesn't.

The traceback:

Traceback (most recent call last): File "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/core/handlers/base.py", line 116, in get_response response = callback(request, *callback_args, **callback_kwargs) File "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/views/decorators/csrf.py", line 77, in wrapped_view return view_func(*args, **kwargs) File "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/views/decorators/http.py", line 41, in inner return func(request, *args, **kwargs) File "/Users/aemdy/PycharmProjects/rezervavau/bms/messages/views.py", line 66, in accept Message.accept(request.POST.get('msg')) File "/Users/aemdy/PycharmProjects/rezervavau/bms/messages/models.py", line 261, in accept thread=thread File "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/db/models/manager.py", line 149, in create return self.get_query_set().create(**kwargs) File "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/db/models/query.py", line 391, in create obj.save(force_insert=True, using=self.db) File "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/db/models/base.py", line 532, in save force_update=force_update, update_fields=update_fields) File "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/db/models/base.py", line 627, in save_base result = manager._insert([self], fields=fields, return_id=update_pk, using=using, raw=raw) File "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/db/models/manager.py", line 215, in _insert return insert_query(self.model, objs, fields, **kwargs) File "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/db/models/query.py", line 1633, in insert_query return query.get_compiler(using=using).execute_sql(return_id) File "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 920, in execute_sql cursor.execute(sql, params) File "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/db/backends/util.py", line 47, in execute sql = self.db.ops.last_executed_query(self.cursor, sql, params) File "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/operations.py", line 201, in last_executed_query return cursor.query.decode('utf-8') File "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 115: invalid continuation byte

My script gives me the following body ����. How can I decode it to get ąčę back?

3
The string with the special characters that you are sending is latin-1 encoded and you're trying to interpret it as being utf-8, which obviously fails.Ioan Alexandru Cucu
But I need a universal way to parse the body of the email message. How can I achieve that? Decoding latin-1 results in àèæëaemdy

3 Answers

6
votes

Well, I found a solution myself. I will do some testing now and will let you guys now if anything fails.

I needed to decode the body again:

body = part.get_payload(decode=True).decode(part.get_content_charset())
1
votes

You might want to try using this:

from email.Iterators import typed_subpart_iterator


def get_charset(message, default="ascii"):
    """Get the message charset"""

    if message.get_content_charset():
        return message.get_content_charset()

    if message.get_charset():
        return message.get_charset()

    return default

def get_body(message):
    """Get the body of the email message"""

    if message.is_multipart():
        #get the plain text version only
        text_parts = [part
                      for part in typed_subpart_iterator(message,
                                                         'text',
                                                         'plain')]
        body = []
        for part in text_parts:
            charset = get_charset(part, get_charset(message))
            body.append(unicode(part.get_payload(decode=True),
                                charset,
                                "replace"))

        return u"\n".join(body).strip()

    else: # if it is not multipart, the payload will be a string
          # representing the message body
        body = unicode(message.get_payload(decode=True),
                       get_charset(message),
                       "replace")
        return body.strip()
0
votes

You might want to take a look at email.iterators (not sure it will solve your encoding problems though).