Why does the order of ANTLR4 tokens matter?

Question

I have a simple grammar that will eventually parse YANG source. When I make when seem to be an arbitrary change the location of the MODULE token the IntelliJ ANTLR4 Plugin can/cannot parse my input.

The input string to be parsed:

module x { }

Here is the grammar that works without any error:

grammar Yang ;

yang: module_open module_close;

module_open : MODULE ID BRACKET_OPEN ;

module_close: BRACKET_CLOSE ;

MODULE: 'module' ;

ID: ([A-Za-z][A-Za-z0-9_-]*) ;
BRACKET_OPEN: '{' ;
BRACKET_CLOSE: '}' ;

WS: [ \t\r\n]+ -> skip ;

Here is the grammar that fails:

grammar Yang ;

yang: module_open module_close;

module_open : MODULE ID BRACKET_OPEN ;

module_close: BRACKET_CLOSE ;

ID: ([A-Za-z][A-Za-z0-9_-]*) ;

MODULE: 'module' ;

BRACKET_OPEN: '{' ;
BRACKET_CLOSE: '}' ;

WS: [ \t\r\n]+ -> skip ;

All I'm doing is cutting-pasting the MODULE token definition before/after the ID token, and it always fails if the MODULE definition is after the ID definition.

What am I missing? I see no discussion of order of tokens in the docs!

EDIT: @BartKiers Related Post... ANTLR4 lexer rules don't work as expected

@BartKiers That answer only acknowledges the problem: What are the specific rules of ordering tokens? I fail to see why what amounts to a simple substitution (.aka. MODULE -> "module") should be subject to the order of declaration. Please cite references in the documentation if possible. — mdeazley
@BartKiers First, thanks for the effort answering newb questions! I get the "The lexer tries to match as much characters as possible" part. But I fail to see why "and when 2 (or more) rules match the same amount of characters, the rule defined first will win." even applies in this situation... I have only one MODULE token definition and only one rule "module_open" that uses that definition." — mdeazley
@BartKiers The closest I can come to answering my own question is "Token definitions must be declared in the order they are used in the parse tree." or something like that... But what if I use them in a different order in the another parse rule? — mdeazley
"First, thanks for the effort answering ...": no problem. "... the order they are used in the parse tree." no, that is not how it works. Since my answer is not clear to you, I'll re-open your question and removed the link. Perhaps someone else is able to explain it. — Bart Kiers

TomServo TomServo · Accepted Answer · 2017-08-23T15:09:04

It fails if module is after ID because the text 'module' is also a valid 'ID'. If the ID rule appears first, then it has precedence. That's when the order of lexer rules matters, when two or more lexer rules can match the same input. In this case, the one appearing first trumps those that follow; it has precedence.

Your excellent test case here is a perfect and exemplary illustration of this behavior at work.

There used to be in the ANTLR4 documentation here a great article by none other than Sam Harwell that explained this perfectly, but I can no longer find it.

Why does the order of ANTLR4 tokens matter?

2 Answers