1
votes

My users will send me posts by email ala Posterous

I'm using Google Apps Engine (GAE) to receive and parse emails. GAE returns the text part of the message.

I need to extract the post from the plain text part of the message.

The plain text can be "contaminated" with promotional headers, footers, signatures, etc.

Also I would like to leave out the "please post this:" or similar some people candidly include.

How would you achieve this?

Are there any tools (simpler than regex) I can use?

UPDATE

Examples:

(in all these examples the post is "Lorem ipsum sit amet..."

=====

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Victor P
[email protected]
visit my blog at: www.example.com/victor

=====

Hello, I like your page. Please can you include this: Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

=====

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

=====

If you find more examples of what a email can be, please feel free to include them in the post.

1
It's far from trivial, but depending on specifics, a few tricks could be applied and address a fair proportion of the targeted situations. Can you provide more details or maybe examples which would show the particular format you have at hand (for example, "plain text = no mark-up at all?".mjv
post a couple of samples on what should and shouldn't get through the regex.Amarghosh
I edited the post and added some examplesVictor

1 Answers

2
votes

I would go with a list of compiled regular expressions. Something along the lines of:

import re

regexes = (
    re.compile("visit my blog at: .*$", re.IGNORECASE),
    re.compile("please post this:", re.IGNORECASE),
    re.compile("please can you include this:", re.IGNORECASE)
    # etc
)

for filePath in files:
    with open(filePath) as file:
        for line in file:
            for regex in regexes:
                print(re.sub(regex, ""))