When I run this code:
public static void main(String[] args) {
String source = "received:last week";
testLexer lexer = new testLexer(CharStreams.fromString(source));
testParser parser = new testParser(new CommonTokenStream(lexer));
System.out.println(parser.parse().toStringTree(parser));
}
the error line 1:0 no viable alternative at input 'received'
is printed to STDERR. When I change {false}?
to {true}?
, the input is parsed correctly (as expected).
If you had expected the input to be parsed as received ':' text
because of the {false}?
predicate, you're misunderstanding how ANTLR's lexer works. The lexer produces tokens independently from the parser. It doesn't matter that the parser is trying to match a TEXT
token, your input is always tokenised in the same way.
The lexer works like this:
- try to consume as much characters as possible
- if there are two or more lexer rules that match the same characters, let the one defined first "win"
Given these rules, it is clear that "received:last week"
is tokenised as RECEIVED
, ':'
and a LASTWEEK
token.
EDIT
Is it possible to have a grammar that can parse this "received:last week" as "received" "last week" only if the "last week" is preceded by "received" but if for example I have "subject:last week" to be parsed as "subject" "last" "week"
You could make the lexer somewhat context sensitive by using lexical modes. You must then create separate lexer- and parser grammars, which might look like this:
TestLexer.g4
lexer grammar TestLexer;
RECEIVED : 'received' -> pushMode(RECEIVED_MODE);
SUBJECT : 'subject';
TEXT : ~[ :]+;
COLON : ':';
SPACES : SPACE+ -> skip;
fragment SPACE : [ \t\r\n];
mode RECEIVED_MODE;
LASTWEEK : 'last' SPACE+ 'week' -> popMode;
RECEIVED_MODE_COLON : ':' -> type(COLON);
RECEIVED_MODE_TEXT : ~[ :]+ -> type(TEXT), popMode;
you can use the lexer above like this in your parser grammar:
TestParser.g4
parser grammar TestParser;
options {
tokenVocab=TestLexer;
}
...
Now "received:last week"
will be tokenised as:
'received' `received`
COLON `:`
LASTWEEK `last week`
EOF `<EOF>`
and "subject:last week"
will be tokenised as:
'subject' `subject`
COLON `:`
TEXT `last`
TEXT `week`
EOF `<EOF>`
EDIT II
You could also move the creation of last week
into the parser like this:
received
: RECEIVED ':' last_week
;
subject
: SUBJECT ':' text
;
last_week
: LAST WEEK
;
text
: TEXT
| LAST
| WEEK
;
RECEIVED : 'received';
SUBJECT : 'subject';
LAST : 'last';
WEEK : 'week';
TEXT : ~[ :]+;