Understanding lexer rule resolution in ANTLR4

Question

I'm reading the ANTLR4 defenitive guide and now I'm at the section about lexer rule resolution. Here is what's written in this section:

grammar KeywordTest;
enumDef : 'enum' '{' ... '}';
...
FOR: 'for'
...
ID:[a-zA-Z]; // does not match 'enum' or 'for'

Rule ID could also match kewords such as enum or for, which means there's more than one rule that could match the same string. [...] Literals such as 'enum' become lexical rules and go immediately after the parser rules but before the explicit lexical rules.

What does it mean and how does it help us to resolve the potential ambiguities? I would say that a declaration like

ENUM_KEYWORD: 'enum'

which ATNLR4 might use internally would be decalred right after the rule enumDef: 'enum' '{' ... '} and will look as follows:

enumDef: ENUM_KEYWORD '{' ... '}
ENUM_KEYWORD: 'enum'

Is that exactly how ANTLR4 does things?

Divisadero Divisadero · Accepted Answer · 2016-03-08T11:08:07

Order of lexer rules is very important in grammar, as the first applicable rule found will be used. You can read more here.

So if you have lexer rules:

ID: [a-zA-Z]+;
FOR: 'for';

based on its order input "for" will be marked as FOR token or as ID token, because for both it is correct.

As a result, grammars very often contains rule 'ambigous' where all keywords are mentioned so when another token contains keyword it would pass.

For example:

alfaNum: (ALFA | NUM | ambigous | '_' )+?;
ambigous: SELECT | WHERE | FROM | WITH | SET | AS;

this way if there is alfaNum token "selection", it would pass. If ambigous would not be specified, it would fail over lexer rule SELECT: 'select';

Understanding lexer rule resolution in ANTLR4

1 Answers