2
votes

I can't get even simpler semantic predicates to work with Antlr 4.6.6 for .net framework 4.8 the grammar below can't find viable alternative for input

"received:last week"

.

grammar test;

// Parser rules
parse
: expr (expr)* EOF
;

expr 
: {false}? received ':' lastweek
| received ':' text
| text
;   

received: RECEIVED;
lastWeek: LASTWEEK;
text: TEXT;

RECEIVED: 'received';

TEXT
: 
~(' ' | ':')+
;

LASTWEEK: 'last week';

SPACES: [ \t\r\n] -> skip;

UPDATE: This is simplification of my problem. Is it possible to have a grammar that can parse this "received:last week" as "received" "last week" only if the "last week" is preceded by "received" but if for example I have "subject:last week" to be parsed as "subject" "last" "week".

1

1 Answers

3
votes

When I run this code:

public static void main(String[] args) {
    String source = "received:last week";
    testLexer lexer = new testLexer(CharStreams.fromString(source));
    testParser parser = new testParser(new CommonTokenStream(lexer));
    System.out.println(parser.parse().toStringTree(parser));
}

the error line 1:0 no viable alternative at input 'received'is printed to STDERR. When I change {false}? to {true}?, the input is parsed correctly (as expected).

If you had expected the input to be parsed as received ':' text because of the {false}? predicate, you're misunderstanding how ANTLR's lexer works. The lexer produces tokens independently from the parser. It doesn't matter that the parser is trying to match a TEXT token, your input is always tokenised in the same way.

The lexer works like this:

  1. try to consume as much characters as possible
  2. if there are two or more lexer rules that match the same characters, let the one defined first "win"

Given these rules, it is clear that "received:last week" is tokenised as RECEIVED, ':' and a LASTWEEK token.

EDIT

Is it possible to have a grammar that can parse this "received:last week" as "received" "last week" only if the "last week" is preceded by "received" but if for example I have "subject:last week" to be parsed as "subject" "last" "week"

You could make the lexer somewhat context sensitive by using lexical modes. You must then create separate lexer- and parser grammars, which might look like this:

TestLexer.g4

lexer grammar TestLexer;

RECEIVED : 'received' -> pushMode(RECEIVED_MODE);
SUBJECT  : 'subject';
TEXT     : ~[ :]+;
COLON    : ':';
SPACES   : SPACE+     -> skip;

fragment SPACE : [ \t\r\n];

mode RECEIVED_MODE;
  LASTWEEK            : 'last' SPACE+ 'week' -> popMode;
  RECEIVED_MODE_COLON : ':'                  -> type(COLON);
  RECEIVED_MODE_TEXT  : ~[ :]+               -> type(TEXT), popMode;

you can use the lexer above like this in your parser grammar:

TestParser.g4

parser grammar TestParser;

options {
  tokenVocab=TestLexer;
}

...

Now "received:last week" will be tokenised as:

'received'                `received`
COLON                     `:`
LASTWEEK                  `last week`
EOF                       `<EOF>`

and "subject:last week" will be tokenised as:

'subject'                 `subject`
COLON                     `:`
TEXT                      `last`
TEXT                      `week`
EOF                       `<EOF>`

EDIT II

You could also move the creation of last week into the parser like this:

received
 : RECEIVED ':' last_week
 ;

subject
 : SUBJECT ':' text
 ;

last_week
 : LAST WEEK
 ;

text
 : TEXT
 | LAST
 | WEEK
 ;

RECEIVED : 'received';
SUBJECT  : 'subject';
LAST     : 'last';
WEEK     : 'week';
TEXT     : ~[ :]+;