4
votes

The following extremely simple example grammar doesn't lex as I expected (at all).

Declaration :   'VAR';
Letter: ('A'..'Z');

message :   Declaration Letter+;

what I expected as a result is that any sequence of letters would lex as individual letters, and the sequence 'VAR' would be lexed as a single token.

When I look at the ANTLRWorks interperter I see the following results:

  • VARA parses into message -> "VAR", "A" (expected)
  • VARVA doesn't parse (MismatchedTokenException(-1 != 5). The lexer hits the second VA and tries to tokenise Declaration. Expected: message -> "VAR", "V", "A"
  • VARVPP parses into message -> "VAR", "V", "P", "P" (expected)
  • VARVALL parses into message -> "VAR", "VALL".

I would like some help understanding this behaviour, and a suggestion how I can fix this.

Specifically:

  • Why does the lexer try to tokenise all strings starting with VA into Declaration if it is followed by one letter?
  • Why doesn't the lexer try to do this with all strings starting with a V?
  • Why doesn't the lexer try to do this if there is an additional character there?
  • How should I change this grammar to parse the way I expected?
2

2 Answers

5
votes

Let's go through all of your 4 examples:

1 "VARA"

enter image description here

All okay.

2 "VARVA"

"VAR" is (obviously) tokenized as VAR, but then the lexer "sees" "VA" and expects an "R", which is not there. It emits the following errors:

line 1:5 mismatched character '<EOF>' expecting 'R'
line 1:5 required (...)+ loop did not match anything at input '<EOF>'

and discards the "VA" resulting in a single token to be created, as you can see when running ANTLRWorks' debugger (ignore the exceptions in the parse, they're not actually there :)):

enter image description here

The thing you must realize is that the lexer will never give up on something it has already matched. So if the lexer sees "VA" and cannot match an "R" after it, it will then look at the other lexer rules that can match "VA". But Letter does not match that (it only matches single letters!) If you change Letter to match more than a single character, ANTLR would be able to fall back on that rule. But not when it matches a single letter: the lexer will not give up the "A" from "VA" to let the Letter rule match. No way around it: this is how ANTLR's lexer works.

This is usually not an issue because there is often some sort of IDENTIFIER rule that the lexer can fall back on when a keyword cannot be matched.

3 "VARVPP"

enter image description here

All okay: "VAR" becomes a VAR and then the lexer tries to match an "A" after the "V" but this does not happen, so the lexer falls back on the Letter rule for the single "V". After that "PP" are both tokenized as Letters.

4 "VARVALL"

"VAR" again becomes a VAR. Then the "L" in "VAL" causes the lexer to produce the following error message:

line 1:5 mismatched character 'L' expecting 'R'

and then the last "L" becomes a Letter:

enter image description here


I guess (or hope) the first 3 question are now answered, which leaves your final answer:

How should I change this grammar to parse the way I expected?

By forcing the lexer to first look ahead in the character stream if there really is "VAR" ahead, and if there's not, just match a single "V" and change the type of the matched token to Letter, like this:

Declaration
 : ('VAR')=> 'VAR'
 |           'V'   {$type=Letter;}
 ;

As mentioned before my answer, see this related Q&A: ANTLR lexer can't lookahead at all

2
votes

The lexer doesn't really perform lookahead, only the parser does; you can read more about it in ANTLR lexer can't lookahead at all. So the problem here is that, once the lexer fails to match VAR, it tries to match what it got so far - VA - and there's no matching token, since Letter cannot match two characters, just one.

As for a solution, a trivial one is to change it to be a single token:

Message :   'VAR' ('A'..'Z')+;
message :   Message;

It won't net you a different token for each letter, though.