From the book of Antlr (section 5.5):
Matching Identifiers
In grammar pseudocode, a basic identifier is a nonempty sequence of
upper- case and lowercase letters. Using our newfound skills, we know
to express the sequence pattern using notation (...)+
. Because the
elements of the sequence can be either uppercase or lowercase letters,
we also know that we’ll have a choice operator inside the subrule.
ID : ('a'..'z'|'A'..'Z')+ ; //
match 1-or-more upper or lowercase
letters
The only new ANTLR notation here is the range operator: 'a'..'z'
means any character from a to z. That is literally the ASCII code
range from 97 to 122. To use Unicode code points, we need to use
'\uXXXX'
literals where XXXX
is the hexadecimal value for the
Unicode character code point value.
As a shorthand for character sets, ANTLR supports the more familiar
regular expression set notation.
ID : [a-zA-Z]+ ; //
match 1-or-more upper or lowercase letters
Rules such as ID sometimes conflict with other lexical rules or
literals refer- enced in the grammar such as 'enum'
.
grammar KeywordTest;
enumDef : 'enum' '{' ... '}' ;
...
FOR : 'for' ;
...
ID : [a-zA-Z]+ ; // does NOT match 'enum' or 'for'
Rule ID could also
match keywords such as enum
and for
, which means there’s more than
one rule that could match the same string. To make this clearer,
consider how ANTLR handles combined lexer/parser grammars such as
this. ANTLR collects and separates all of the string literals and
lexer rules from the parser rules. Literals such as 'enum' become
lexical rules and go immediately after the parser rules but before the
explicit lexical rules.
ANTLR lexers resolve ambiguities between lexical rules by favoring the
rule specified first. That means your ID rule should be defined after
all of your keyword rules, like it is here relative to FOR. ANTLR puts
the implicitly gener- ated lexical rules for literals before explicit
lexer rules, so those always have priority. In this case, 'enum'
is
given priority over ID automatically. Because ANTLR reorders the
lexical rules to occur after the parser rules, the following variation
on KeywordTest results in the same parser and lexer:
grammar KeywordTestReordered;
FOR : 'for' ;
ID : [a-zA-Z]+ ; // does NOT match 'enum' or 'for' ...
enumDef : 'enum' '{' ... '}' ;
...