4
votes

I have the following grammar:

rule: 'aaa' | 'a' 'a';

It can successfully parse the string 'aaa', but it fails to parse 'aa' with the following error:

line 1:2 mismatched character '<EOF>' expecting 'a'

FYI, it is the lexer's problem not the parser's because I don't even call the parser. The main function looks like:

@members {
  public static void main(String[] args) throws Exception {
    RecipeLexer lexer = new RecipeLexer(new ANTLRInputStream(System.in));
    for (Token t = lexer.nextToken(); t.getType() != EOF; t = lexer.nextToken())
      System.out.println(t.getType());
  }
}

The result is the same with the more obvious version:

rule: AAA | A A;
AAA: 'aaa';
A: 'a';

Obviously the ANTLR lexer tries to match the input 'aa' with the rule AAA which fails. Apart from that ANTLR is an LL(*) parser or whatever, the lexer should work separately from the parser and it should be able to resolve ambiguity. The grammar works fine with the good old lex(or flex) but it doesn't seem with ANTLR. So what is the problem here?

Thanks for the help!

1
How are the tokens defined in your lexer? Looks to me that the lexer is preferring to match for a instead of aaa given a single a as input. - Dervall
@Dervall The token file looks like: A=4 AAA=5 It prefers aaa to a. And it can parse aaa and a but not aa. - K J
@AustinHenley: Yes, it is greedy in the sense that it prefers longer tokens when there are multiple choices. But with the input 'aa', 'aaa' is not even a possible choice. - K J
Check out this incredibly detailed yet easy to follow page: wincent.com/wiki/ANTLR_lexers_in_depth. It helped me a lot to understand the ANTLR Lexer quirks. Especially the ".+ and .* default to non-greedy behaviour" is quite surprising! - TFuto

1 Answers

6
votes

ANTLR's generated parsers are (or can be) LL(*), not its lexers.

When the lexer sees the input "aa", it tries to match token AAA. When it fails to do so, it tries to match any other token that also matches "aa" (the lexer does not backtrack to match A!). Since this is not possible, an error is produced.

This is usually not a problem, since in practice, there's often some sort of identifier rule "aa" can fall back to. So, what actual problem are you trying solve, or were you only curious of the inner workings? If it's the first, please edit your question and describe your actual problem.