Proper way to resolve ANTLR lexer rule ambiguities?

Question

Please see the source code available at: https://gist.github.com/1684022.

I've got two tokens defined:

ID  :   ('a'..'z' | 'A'..'Z') ('0'..'9' | 'a'..'z' | 'A'..'Z' | ' ')*;

PITCH   
    :   (('A'|'a') '#'?)
    |   (('B'|'b') '#'?) 
    |   (('C'|'c') '#'?);

Obviously, the letter "A" would be an ambiguity.

I further define:

note    :   PITCH;
name    :   ID;
main    :   name ':' note '\n'?

Now, if I enter "A:A" as input to the parser, I always get an error. Either the parser expects PITCH or ID depending on whether ID or PITCH is defined first:

mismatched input 'A' expecting ID

What is the proper way to resolve this so that it works as intended?

As is described, although it makes intuitive sense how the parsing should work, ANTLR doesn't do the "right thing". That is, even though the main rule says a name/ID should come first, the lexer seems to be ignorant of this and identifies "A" as a PITCH because it follows the "longest match"/"which comes first" rule rather than the more reasonable "what the rule says" rule.

Is the only solution to fake/hack it by matching both ID and PITCH, and then recombining them later as dasblinkenlight says?

Look, Bart. Whether or not I understand ANTLR, the point you keep hammering on about, is irrelevant. I am seeking a solution that makes sense and although you have provided one answer and four comments, none of them are solutions, just commentary on my post or my understanding. If you understand ANTLR and you understand my problem better than I do, then post a real solution. — Ana

Sergey Kalinichenko Sergey Kalinichenko · Accepted Answer · 2012-01-26T19:19:19

Here is how I would re-factor this grammar to make it work:

ID  :   (('a'..'z' | 'A'..'Z') ('0'..'9' | 'a'..'z' | 'A'..'Z' | ' ')+)
    |   ('d'..'z' | 'D'..'Z');

PITCH : 'a'..'c' | 'A'..'C';

SHARP : '#';

note    :   PITCH SHARP?;

name    :   ID | PITCH;

main    :   name ':' note '\n'? EOF

This separates long names from one-character pitch names, which get "reunited" in the parser. Also the "sharp" token gets its own name, and gets recognized in the parser as an optional token.

Proper way to resolve ANTLR lexer rule ambiguities?

1 Answers