1
votes

When sending an email, many servers add additional line breaks to limit the length of each line.

How can the original line breaks be recovered when fetching the email in a PHP script?

Example

Suppose I send the following content:

Lorem ipsum Dolore incididunt in culpa ea ea sed quis sint voluptate quis laborum ullamco Excepteur do adipisicing consequat ex in reprehenderit officia in ad deserunt magna nulla dolor laborum occaecat reprehenderit aliquip dolor ea anim ea in veniam adipisicing culpa tempor qui elit voluptate consectetur elit laboris minim consectetur laboris anim incididunt Ut sunt sunt mollit elit irure do cillum dolore consequat in ea culpa ut velit sunt nulla in dolore voluptate dolore laborum reprehenderit dolore ut.
Ut non in veniam enim minim elit ad ut id ad eu voluptate cillum dolor laboris irure tempor mollit dolore exercitation eiusmod ea non ea ullamco nostrud cillum nostrud laborum commodo esse reprehenderit ut deserunt officia do in anim dolore ullamco pariatur ex amet nulla Excepteur mollit officia fugiat eu sed quis nisi fugiat dolor ea commodo ut sunt in consequat consectetur ut nulla pariatur est dolor dolore non ut occaecat officia Duis Ut ex exercitation esse ullamco nulla incididunt commodo pariatur dolore nostrud fugiat id dolor minim non sint amet adipisicing occaecat enim non Ut ad irure sint aliquip nisi ut commodo minim proident elit nulla quis ut ad dolor Excepteur dolore Duis.

Note that there is just one single line break in this text!

Checking the source code of the email at the receiving end using Thunderbird, or fetching the email body via PHP, the content is formatted like this:

Lorem ipsum Dolore incididunt in culpa ea ea sed quis sint voluptate
quis laborum ullamco Excepteur do adipisicing consequat ex in
reprehenderit officia in ad deserunt magna nulla dolor laborum occaecat
reprehenderit aliquip dolor ea anim ea in veniam adipisicing culpa
tempor qui elit voluptate consectetur elit laboris minim consectetur
laboris anim incididunt Ut sunt sunt mollit elit irure do cillum dolore
consequat in ea culpa ut velit sunt nulla in dolore voluptate dolore
laborum reprehenderit dolore ut.
Ut non in veniam enim minim elit ad ut id ad eu voluptate cillum dolor
laboris irure tempor mollit dolore exercitation eiusmod ea non ea
ullamco nostrud cillum nostrud laborum commodo esse reprehenderit ut
deserunt officia do in anim dolore ullamco pariatur ex amet nulla
Excepteur mollit officia fugiat eu sed quis nisi fugiat dolor ea commodo
ut sunt in consequat consectetur ut nulla pariatur est dolor dolore non
ut occaecat officia Duis Ut ex exercitation esse ullamco nulla
incididunt commodo pariatur dolore nostrud fugiat id dolor minim non
sint amet adipisicing occaecat enim non Ut ad irure sint aliquip nisi ut
commodo minim proident elit nulla quis ut ad dolor Excepteur dolore Duis.

Note that each line is limited to a certain length, so 16 additional line breaks are present. These additional line breaks were automatically added somewhere in the chain of events leading to me receiving the email.

I want my email-fetching PHP script to remove the additional line breaks to restore the original two-line format of the content.

I know that the new line breaks are not added in by the PHP script, I know where they come from, what I do not know is how I could make my PHP script remove those line breaks.

Here is the code used to fetch the email body:

$connection = imap_open(
    sprintf(
        '{%s:110/pop3}INBOX',
        Configure::read('Email.Inbox.host')
    ),
    Configure::read('Email.Inbox.email'),
    Configure::read('Email.Inbox.password')
);

$mailbox = imap_check($connection);
$messages = imap_fetch_overview($connection, '1:' . $mailbox->Nmsgs); 

foreach($messages as $message) {
    $content = imap_fetchbody($connection, $message->msgno, 1);
}

What have I tried?

I tried using imap_body instead of imap_fetchbody, as the former does not process the email body. But the additional line breaks are already present before that and are indistinguishable from the regular line breaks. Both consist of \r\n.

I assume there has to be a way to do this, as Thunderbird displays the received email with the correct formatting, without the additional 16 line breaks, although they are present in the source code of the displayed message. So there probably has to be a way to strip the additional 16 line breaks from the email.

Here is a screenshot from Thunderbird which shows the source code of the email on the top and the resulting plain-text display on the bottom.

What is this magic? Teach me, master!

1
you have format=flowed, which means the mail client CAN ignore the line breaks if it so chooses.Marc B
@MarcB: That is exactly my question! How CAN I ignore the additional line breaks when fetching the email using PHP? Doing something like str_replace('\r\n', '', $content); would replace all line breaks, not only the ones automatically added.Lars Ebert
php wouldn't add any new line breaks. it'll spit out EXACTLY what came back from the imap server. it's up to you to write the code to reformat that text into whatever you want it to be.Marc B
@MarcB Let me illustrate my problem: Both Thunderbird and my PHP script receive the same content from the IMAP server. That content, in both cases, contains additional line breaks which were added at some point. Thunderbird apparently does some magic on the fetched content to remove the added line breaks. I want my PHP script to do just that. It shall receive the liney-breakey-content from the IMAP server an filter out all the line breaks that were not present when the email was sent. But I do not know how! My question is just "How?"Lars Ebert
yeah, because TB has a bunch of code to handle formatting text for display. PHP doesn't. it's not a mail client, it's up to YOU to provide the code/logic to replicate whatever TB is doing. The email is just text, so whatever you end up doing is going to involve string manipulation.Marc B

1 Answers

1
votes

Even though this question is old, it was one of the top hits when I ran into this exact same problem. As Marc pointed out in the comments, it does have to do with format=flowed. So I dove into RFC 2646 and found section 4.1, Generating Format=Flowed:

Because a soft line break is a SP CRLF sequence, the generating agent creates one by inserting a CRLF after the occurance of a space.

A generating agent SHOULD NOT insert white space into a word (a sequence of printable characters not containing spaces). If faced with a word which exceeds 79 characters (but less than 998 characters, the [SMTP] limit on line length), the agent SHOULD send the word as is and exceed the 79-character limit on line length.

So in order to get the email as it was originally written, simply search for all SP+CRLF occurrences and replace them with nothing. Then you might also wanna undo the space-stuffing, while also accounting for quoted text (lines starting with any number of > chars followed by a space). According to the RFC, the order of tests is quotation marks > space stuffing > flowed lines:

On reception, if the first character of a line is a space, it is logically deleted. This occurs after the test for a quoted line, and before the test for a flowed line.

A crude PoC from my own kitchen:

// I'm using fetchmime() because I want to be sure I'm getting the proper MIME type for the relevant section
$mimes = imap_fetchmime($connection, $message->msgno, $section);

// I don't want to store all headers in an array since I just want to know the Content-Type
// [ \t]* is probably not necessary but it's there in case of broken clients/servers
if(preg_match('/^[ \t]*Content-Type.*format=flowed\b/mi', $mimes)) {
    // First, let's undo space stuffing but don't touch stuffed lines with quotes
    $content = preg_replace('/^ +(?!>+ )/m', '', $content);

    // Then, remove flowed SP+(CR)LF sequences as well as any possible quotation marks that might appear after it to reform one long line of text
    $content = preg_replace('/( )\r?\n(>+ +)?/', '$1', $content);

    // Remove empty quoted lines at *the end of the string only*, keeping any such lines anywhere else as-is for readability
    $content = preg_replace('/(\r?\n>+\s*)+$/', '', $content);
}
// And finally trim the entire thing (regardless of formatting)
$content = trim($content);
// Or when outputting to browsers:
//$content = nl2br(trim($content));

For me this works just fine on:

  • simple one-line emails
  • the lorem ipsum example the OP gave with 2 paragraphs
  • one-liners followed by 2 linebreaks and a signature consisting of 2 lines
  • emails with quotes up to 4 levels (and probably beyond but I didn't bother checking that far)