1
votes

I've been using different lexer modes in antlr and I've been encountered problems with the 'more' command in the lexer, as it doesn't match everything inside this respective token. To make things more clear, here is what my code looks like roughly:

//DEFAULT_MODE
fragment A: ('A'); //same done for A-Z
KEYWORD_CLASS: C L A S S;
NUM: [0-9];
KEYWORD_SMTH: S M T H->mode(NUMBER_MODE);


mode NUMBER_MODE;

NUMBER: NUM+ ->mode(ANOTHER_MODE);
NO_NUMBER: ~[0-9]->more, mode(DEFAULT_MODE);

Now when I try to test the parser rule

rule: KEYWORD_SMTH NUMBER? CLASS;

then I'm expecting to match the following phrase:

SMTH CLASS

But for some reason the first letter of the C is not matched to the Token. I have to type something like

SMTH gCLASS

in order to match the keyword CLASS. If I understand correctly, the 'more' command will match everything that is not a number and bring it back to default mode, so it can be part of another token. Can someone please tell me where my mistake is? Thanks.

1

1 Answers

2
votes

Assuming you omitted the rule that skips/hides spaces, this is what happens when tokenising SMTH CLASS:

  1. token KEYWORD_SMTH is created for the text text "SMTH"
  2. the mode changes from DEFAULT_MODE to NUMBER_MODE
  3. the beginning of a token is created for the text "C" (NO_NUMBER...)
  4. the mode changes from NUMBER_MODE to DEFAULT_MODE
  5. inside the DEFAULT_MODE, the "C" previously matched is glued to whatever "LASS" is tokenised as (note this will NOT match the KEYWORD_CLASS!)

So, assuming that "LASS" is tokenised as an IDENTIFIER token or similar, you will have ended up with 2 tokens:

  1. KEYWORD_SMTH (text "SMTH")
  2. IDENTIFIER (text "C" + "LASS")