ANTLR (Lexer): separate arbitrary identifiers from keywords

Question

I'm trying to create a (simple) Lexer for bat/cmd files (for syntax coloring). As part of this task, I need to separate keywords from (arbitrary) identifiers. But according to this answer ANTLR tries to let the longest match win over shorter ones. My grammar looks like this so far

lexer grammar CmdLexer;

Identifier
    : IdentifierNonDigit
      (  IdentifierNonDigit
      |  Digit
      )+
    ;

Number
    : Digit+
    ;

fragment IdentifierNonDigit
    : [a-zA-Z_\u0080-\uffff]
    ;

fragment Digit
    : [0-9]
    ;

Punctuation
    : [\u0021-\u002f\u003a-\u0040\u005b-\u0060\u007b-\u007f]+
    ;

Keyword
    : A P P E N D
    | A T
    | A T T R I B
    | B R E A K
    | C A L L
    | C D
    | C H C P
    | C H D I R
    | C L S
    | C O L O R
    | C O P Y
    | D A T E
    | D E L
    | D I R
    | D O
    | E C H O
    | E D I T
    | E N D L O C A L
    | E Q U
    | E X I S T
    | E X I T
    | F C
    | F O R
    | F T Y P E
    | G O T O
    | G E Q
    | G T R
    | I F
    | I N
    | L E Q
    | L S S
    | M D
    | M K D I R
    | M K L I N K
    | M O R E
    | M O V E
    | N E Q
    | N O T
    | N U L
    | P A T H
    | P A U S E
    | P O P D
    | P U S H D
    | R D
    | R E N
    | R E N A M E
    | S E T
    | S E T L O C A L
    | S H I F T
    | S T A R T
    | T I T L E
    | T R E E
    | T Y P E
    | W H E R E
    | W H O A M I
    | X C O P Y
    ;

fragment A:('a'|'A');
fragment B:('b'|'B');
fragment C:('c'|'C');
fragment D:('d'|'D');
fragment E:('e'|'E');
fragment F:('f'|'F');
fragment G:('g'|'G');
fragment H:('h'|'H');
fragment I:('i'|'I');
fragment J:('j'|'J');
fragment K:('k'|'K');
fragment L:('l'|'L');
fragment M:('m'|'M');
fragment N:('n'|'N');
fragment O:('o'|'O');
fragment P:('p'|'P');
fragment Q:('q'|'Q');
fragment R:('r'|'R');
fragment S:('s'|'S');
fragment T:('t'|'T');
fragment U:('u'|'U');
fragment V:('v'|'V');
fragment W:('w'|'W');
fragment X:('x'|'X');
fragment Y:('y'|'Y');
fragment Z:('z'|'Z');

Whitespace
    : [ \t]+
      -> skip
    ;

Newline
    : ( '\r' '\n'?
      | '\n'
      )
      -> skip
    ;

LineComment
    : ( '@'? R E M ~[\r\n]*
      | '@'? '::' ~[\r\n]*
      )
      -> skip
    ;

but it always matches everything as Identifier, even words like append or CALL. I don't see how modes would solve this problem here, but how to give a certain rule higher priority (here Keyword over another (here Identifier)?

sepp2k sepp2k · Accepted Answer · 2020-04-27T12:48:18

But according to this answer ANTLR tries to let the longest match win over shorter ones.

It does and that should be what you want. Note that this rule (the so-called maximal munch rule) has nothing to do with whether append is matched as a keyword or identifier. It has to do with whether appendix is matched as the keyword append, followed by the identifier ix; or as the single identifier appendix. Since the latter is clearly what one wants in most contexts, the maximal munch rule is useful.

What matters in this case though is what happens if multiple rules produce a match of the same length. In that case ANTLR applies the rules that is defined first in the grammar. So if you change the order of your definitions so that Keyword comes before Identifier, the Keyword rule will take precedence in cases where both rules would produce a match of the same length (and the longest match would still win in cases where that's not the case). So an input like append appendix would be tokenized as the keyword append, followed by the identifier appendix, which should be what you want.

PS: I'm not sure where/how your lexer is going to be used, but generally you'd want to distinguish between different keywords instead of having one rule that matches all keywords. If the tokens are going to be used as an input to parser, the information that something is a keyword is not very useful without knowing which keyword it is.

ANTLR (Lexer): separate arbitrary identifiers from keywords

1 Answers