1
votes

I'd like to build a natural language date parser in ANTLR4 and got stuck on ignoring "noise" input. The simplified grammar below parses any string that contains valid dates in the format DATE MONTH:

dates
    : simple_date dates
    | EOF
    ;

simple_date
    : DATE MONTH
    ;

DATE  : [0-9][0-9]?;
MONTH : January | February | March // etc.;

Text such as "1 January 22 February" will be accepted. I wanted the grammar accept other text as well, so I added ANY : . -> skip; at the end:

dates
    : simple_date dates
    | EOF
    ;

simple_date
    : DATE MONTH
    ;

DATE  : [0-9][0-9]?;
MONTH : January | February | March // etc.;
ANY   : . -> skip;

This doesn't quite do what I want, however. While string such as "On 1 January and 22 February" is accepted and the simple_date rule is matched twice, string "On 1XX January" will also match the rule.

Question: How do I build a grammar where rules are matched only with the exact token sequence while ignoring all other input, including tokens in an order not defined in any of the rules? Consider the following cases:

"From 1 January to 2 February" -> simple_date matches "1 January" and "2 February"
"From 1XX January to 2 February" -> simple_date matches "2 February", rest is ignored
"From January to February" -> no match, everything ignored
1
you need to post a working grammar. How does your grammar match "1 January 22 February"? Some rules should have used the + or * operators which wasn't shown.JavaMan
Sorry, there was a typo in the grammars - I've changed the date to dates in the top-level rule to make it work as described.David

1 Answers

1
votes

Do not drop extra "noise" in lexer such as your ANY rule. Lexer does not know under what context the current token is. And what you want is "dropping some noise tokens when it is not of the form DATE MONTH". Move your ANY rule to parser rules that match the noise.

Also, it's advisable to drop white spaces IN THE LEXER. But in that case, your ANY rule should exclude those matched by the WS rule. Also pay attention that your DATE rule intercepted a noise token of the form [0-9][0-9]?

dates
    : (noise* (simple_date) noise*)+

    ;

simple_date
    : DATE MONTH
    ;
noise: (DATE|ANY);

DATE  : [0-9][0-9]?;
MONTH : 'January' | 'February' | 'March' ;
ANY   : ~(' '|'\t' | '\f')+ ;
WS    : [ \t\f]+ -> skip;

Accepts:

1 January and 22 February  noise 33
1 January and 22 February 3

Rejects:

1xx January

This wasn't fully tested. Also your MONTH lexer rule also intercepted a standalone month literal (e.g. January) which is considered a noise but not handled in my grammar e.g.

22 February January