1
votes

I am trying to extract the body of GMAIL emails via GMAIL API, using Python well.

I am able to extract the messages using the commands below. However, there seems to be an issue with the encoding of the email text (Original email has html in it) - for some reason, every time before each quote 3D appears.

Also, within the a href="my_url", I have random equal signs = appearing, and at the end of the link, there is &amp character which is not in the original HTML of the email.

Any idea how to fix this?

Code I use to extract the email:

from __future__ import print_function
from googleapiclient.discovery import build
from httplib2 import Http
from oauth2client import file, client, tools

from apiclient import errors
import base64
msgs = service.users().messages().list(userId='me', q="no-reply@hello.com",maxResults=1).execute()
for msg in msgs['messages']:message = service.users().messages().get(userId='me', id=m_id, format='raw').execute()

"raw": Returns the full email message data with body content in the raw field as a base64url encoded string; the payload field is not used."

print(base64.urlsafe_b64decode(message['raw'].encode('ASCII')))

td style=3D"padding:20px; color:#45555f; font-family:Tahoma,He= lvetica; font-size:12px; line-height:18px; "

JPk79hd = JFQZEhc6%2BpAiQKF8M85SFbILbNd6IG8%2FEAWwe3VTr2jPzba4BHf%2FEnjMxq66fr228I7OS =

3
after putting the whole document in front of me - it looks like python marked the end of each line with equal signs, because it seems to be trying to keep each line to ### characters. any thoughts on what could cause that? If I can at least get rides of equal signs at the end of each string, I can accomplish the rest with find-replace_with_black. Thank you in advanceFlyingZebra1
looks like the equal signs are related to base64's encoding - stackoverflow.com/questions/6916805/…FlyingZebra1

3 Answers

3
votes

You should check the Content-Transfer-Encoding header to see if it specifies quoted-printable because that looks like quoted-printable encoded text.

Per RFC 1521, Section 5.1:

The Quoted-Printable encoding is intended to represent data that largely consists of octets that correspond to printable characters in the US-ASCII character set. It encodes the data in such a way that the resulting octets are unlikely to be modified by mail transport. If the data being encoded are mostly US-ASCII text, the encoded form of the data remains largely recognizable by humans. A body which is entirely US-ASCII may also be encoded in Quoted-Printable to ensure the integrity of the data should the message pass through a character-translating, and/or line-wrapping gateway.

Python's quopri module can be used to decode emails with this encoding.

0
votes

Sadly I wasn't able to figure out the proper way to decode the message.

I ended up using the following workaround, which:

1) splits the message into a list, with each separate line as a list item

2) Figures out the list location of one of the strings, and location of ending string.

3) Generates a new list out of #2, then regenerates the same list, cutting out the last character (equals sign)

4) Generates a string out of the new list

5) searches for the URL I want

    x= mime_msg.splitlines() #convert to list
    a = ([i for i, s in enumerate(x) if 'My unique start string' in s])[0] #get list# of beginning
    b = ([i for i, s in enumerate(x) if 'my end id' in s])[0] #end
    y = x[a:b]   #generate list w info we want
    new_list=[]
    for item in y:new_list.append(item[:-1]) #get rid of last character, which bs base64 encoding is "="
    url = ("".join(new_list)) #convert to string
    url = url.replace("3D","").replace("&amp","") #cleaner for some reason - encoding gives us random 3Ds + &amps
    csv_url = re.search('Whatever message comes before the URL (.*)',url).group(1)

The above uses

import re 
from __future__ import print_function
from googleapiclient.discovery import build
from httplib2 import Http
from oauth2client import file, client, tools

from apiclient import errors
import base64
import email
0
votes

I have send a mail from my webservice in asp.net to gmail The content is in true html
It showed as wanted despite the =3D

Dim Bericht As MailMessage
Bericht = New MailMessage

the content of my styleText is

<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-=1">
<meta content="text/html; charset=us-ascii">
<style>h1{color:blue;}
.EditText{
background:#ff0000;/*rood*/
height:100;
font-size:10px;
color:#0000ff;/*blauw*/
}
</head>

and the content of my body is

<div class='EditText'>this is just some text</div>

finaly I combine it in

Bericht.Body = "<html>" & styleText & "<body>" & content& "</body></html>"

if I look in the source of the message received, there is still this 3D it shows

<html><head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-=
=3D1">
<meta content=3D"text/html; charset=3Dus-ascii">
<style>h1{color:blue;}
.EditText{
background:#ff0000;/*rood*/
height:100;
font-size:10px;
color:#0000ff;/*blauw*/
}
</style>
</head><body><div class=3D'EditText'>MailadresAfzender</div></body></html>

the result showed a blue text with a red background. Great