I've tried for an entire week to build using antlr a grammar that allows me to parse an email message.
My goal is not to parse exhaustively the entire email into tokens but into relevant sections.
Here is the document format that I have to deal with. //
depict inline comments that are not part of the message:
Subject : [SUBJECT_MARKER] + lorem ipsum...
// marks a message that needs to be parsed.
// Subject marker can be something like "help needed", "action required"
Body:
// irrelevant text we can ignore, discard or skip
Hi George,
Hope you had a good weekend. Another fluff sentence ...
// end of irrelevant text
// beginning of the SECTION_TYPE_1. SECTION_TYPE_1 marker is "answers below:"
[SECTION_TYPE_1]
Meaningful text block that needs capturing, made of many sentences: Lorem ipsum ...
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SECTION_END_MARKER] // this is "\n\n"
// SENTENCE_MARKER can be "a)", "b)" or anything that is in the form "[a-zA-Z]')'"
// one important requirement is that this SENTENCE_MARKER matches only inside a section. Either SECTION_TYPE_1 or SECTION_TYPE_2
// alternatively instead of [SECTION_TYPE_1] we can have [SECTION_TYPE_2].
// if we have SECTION_TYPE_1 then try to parse SECTION_TYPE_1 else try to parse SECTION_TYPE_2.enter code here
[SECTION_TYPE_2] // beginning of the section type 1;
Meaningful text bloc that needs capturing. Many sentences Lorem ipsum ...
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SECTION_END_MARKER] // same as above
The problems I'm facing are the following:
- I didn't figure out a good way to skip text at the beginning of the message and start applying the parsing rules only after a marker has been found. SECTION_TYPE_1
- Capture all the text inside a section between section start and the sentence markers
- After a SECTION_END marker ignore all the text that comes afterwards.