0
votes

I've tried for an entire week to build using antlr a grammar that allows me to parse an email message.

My goal is not to parse exhaustively the entire email into tokens but into relevant sections.

Here is the document format that I have to deal with. // depict inline comments that are not part of the message:

Subject : [SUBJECT_MARKER] + lorem ipsum...
// marks a message that needs to be parsed.
// Subject marker can be something like "help needed", "action required"

Body: 

// irrelevant text we can ignore, discard or skip
Hi George,
Hope you had a good weekend. Another fluff sentence ...
// end of irrelevant text


// beginning of the SECTION_TYPE_1. SECTION_TYPE_1 marker is "answers below:" 
[SECTION_TYPE_1]

Meaningful text block that needs capturing, made of many sentences: Lorem ipsum ...

[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.

[SECTION_END_MARKER] // this is "\n\n"

// SENTENCE_MARKER can be "a)", "b)" or anything that is in the form "[a-zA-Z]')'"
// one important requirement is that this SENTENCE_MARKER matches only inside a section. Either SECTION_TYPE_1 or SECTION_TYPE_2


// alternatively instead of [SECTION_TYPE_1] we can have [SECTION_TYPE_2].
// if we have SECTION_TYPE_1 then try to parse SECTION_TYPE_1 else try to parse SECTION_TYPE_2.enter code here

[SECTION_TYPE_2] // beginning of the section type 1;

Meaningful text bloc that needs capturing. Many sentences Lorem ipsum ...

[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.

[SECTION_END_MARKER] // same as above

The problems I'm facing are the following:

  • I didn't figure out a good way to skip text at the beginning of the message and start applying the parsing rules only after a marker has been found. SECTION_TYPE_1
  • Capture all the text inside a section between section start and the sentence markers
  • After a SECTION_END marker ignore all the text that comes afterwards.
1

1 Answers

4
votes

Antlr is a parser for structured, ideally unambiguously structured, texts. Unless your source messages have relatively well-defined features that reliably mark the message parts of interest, Antlr is unlikely to work.

A better approach would be to use a natural language processor (NLP) package to identify the form and object of each sentence or phrase to thereby identify those of interest. The Stanford NLP package is quite well known (Github).

Update

The necessary grammar will be of the form:

message : subject ( sec1 | sec2 | fluff )* EOF ;

subject : fluff* SUBJECT_MARKER subText EOL ;
subText : ( word | HWS )+ ;

sec1    : ( SECTION_TYPE_1 content )+ SECTION_END_MARKER     ;
sec2    : ( SECTION_TYPE_2 content )+ SECTION_END_MARKER     ;
content : ( word | ws )+ ;

word    : CHAR+ ;
ws      : ( EOL | HWS )+ ;

fluff   : . ;

SUBJECT_MARKER      : 'marker' ;
SECTION_TYPE_1      : 'text1' ;
SECTION_TYPE_2      : 'text2' ;
SENTENCE_MARKER     : [a-zA-Z0-9] ')' ;

EOL                 : '\r'? '\n';
HWS                 : [ \t] ;
CHAR                : . ;

Success will depend on just how unambiguous the various markers are -- and it is a given that there will be ambiguities. Either modify the grammar to handle the ambiguities explicitly or defer to the tree-walk/analysis phase to resolve.