Antlr4 common tokens with modes

Question

Grammar before moving tokens to a common file

lexer grammar ALexer;

COMMAND_START
    : [a-zA-Z]                          -> pushMode(COMMAND_MODE)
    ;

EQUALS
    : '='                               -> pushMode(VALUE_MODE)
    ;

mode COMMAND_MODE;

COMMAND_NAME_REMAINDER
    : ([a-zA-Z0-9_ ]? [a-zA-Z0-9])*     -> popMode
    ;

mode VALUE_MODE; 

IDENTIFIER
    : A_Z ((UNDERSCORE | A_Z | DIGIT | WS)*? (UNDERSCORE | A_Z | DIGIT))* -> popMode
    ;

Grammar after moving tokens to a common file

Common lexer is imported by 3 other lexers. It has IDENTIFIER token which is shared.

lexer grammar CommonLexer;

..
..
IDENTIFIER
    : A_Z ((UNDERSCORE | A_Z | DIGIT | WS)*? (UNDERSCORE | A_Z | DIGIT))*
    ;

The following lexer imports the Common lexer and has a few modes

lexer grammar ALexer;

import CommonLexer;

COMMAND_START
    : [a-zA-Z]                          -> pushMode(COMMAND_MODE)
    ;

EQUALS
    : '='                               -> pushMode(VALUE_MODE)
    ;


mode COMMAND_MODE;

COMMAND_NAME_REMAINDER
    : ([a-zA-Z0-9_ ]? [a-zA-Z0-9])*     -> popMode
    ;

mode VALUE_MODE; 

IDENTIFIER_VALUE_MODE
    : IDENTIFIER                            -> type(IDENTIFIER), popMode
    ;

Parser grammar:

parser grammar AParser;

options { tokenVocab=ALexer; }

genericCommand
    : COMMAND_START COMMAND_NAME_REMAINDER? (COLON parameterArray)?
    ;

Result: A command such as "Delete Resources: a;" which was earlier identified as COMMAND_START now is recognized as IDENTIFIER.

result screenshot

Question: How can I fix this? IDENTIFIER should remain in the CommonLexer.

Please let me know if you need more details, thanks.

Mike Cargal Mike Cargal · Accepted Answer · 2021-03-11T03:34:28

I can't tell for sure (you just have eclipses in the Common Lexer extract), but in the original Lexer grammar IDENTIFIER would only be matched if you had been pushed the VALUE_MODE. It appears that you've lost that characteristic when you created the Common Lexer. Since it's "in the open" in the common Lexer, it will match whether or not you're in VALUE_MODE (and the length will make it a stronger match). That explains the different behavior.

Your IDENTIFIER lexer rule matches a longer string of characters than COMMAND_START, so it will take precedence. You'll not get a "hit" on the COMMAND_START rule to push you into COMMAND_MODE. This is the heart of your problem. Your IDENTIFIER rule overlaps the COMMAND_START rule and will always be at least as long (1 character) or longer than the COMMAND_START rule match, so ANTLR will always favor it.

Without the fragment definitions for A_Z, UNDERSCORE, DIGIT, and WS (You're using them like fragments, so I assume they are), it's pretty tough to determine what you intend to be the difference between a COMMAND and an IDENTIFIER.

The way you have COMMAND_START to trigger a mode only to pop it immediately is "unusual". I would expect to see a COMMAND Lexer rule that incorporated the whole pattern:

COMMAND: [a-zA-Z]([a-zA-Z0-9_ ]? [a-zA-Z0-9])*

Here is where I can't really tell what should distinguish a COMMAND from an IDENTIFIER in your input stream. (Including WS within the tokens is a bit of an anti-pattern as well).

Is this something where you have control over the language design, or something where you have to match an established definition?

If you have control, I suggest you should read up a bit and reconsider your approach.

If it's already established, perhaps you can share the established definition and how it distinguishes between IDENTIFIERs and COMMANDs.

Reading between the lines here, it appears that whatever is before the colon is intended to be the command, and whatever is after the colon is where you're expecting IDENTIFIERs.

I think you're trying to put too much of the work into the Lexer. Try re-thinking your command parser rule as something more like a parser rule:

genericCommand:  +IDENTIFIER (COLON parameterArray)?;

(I'd suggest dropping the WS from both the COMMAND and IDENTIFIER tokens if you can manage it. That tends to create all sorts of tokenization ambiguity issues.)

Antlr4 common tokens with modes

1 Answers