Lexer, overlapping rule, but want the shorter match

Question

I want to read an input stream and divide the input into 2 types: PATTERN & WORD_WEIGHT, which are defined below.

The problem arises from the fact that all the chars defined for a WORD_WEIGHT are also valid for a PATTERN. When we have multiple WORD_WEIGHTs without spaces between the lexer will match PATTERN rather than deliver multiple WORD_WEIGHTs.

I need to be able to handle the following cases and get the indicated result:

[20] => WORD_WEIGHT
cat => PATTERN
[dog] => PATTERN

And this case, which is the problem. It matches PATTERN because the lexer will select the longer of the 2 possibilities. Note: there's no space between them.

[20][30] => WORD_WEIGHT WORD_WEIGHT

Also need to handle this case (which imposes some limits on the possible solutions). Note that the brackets may not be matching for a PATTERN...

[[[cat] => PATTERN

Here's the grammar:

grammar Brackets;

fragment
DIGIT
    : ('0'..'9')
    ;

fragment
WORD_WEIGHT_VALUE           
    : ('-' | '+')? DIGIT+ ('.' DIGIT+)? 
    | ('-' | '+')? '.' DIGIT+
    ;

WORD_WEIGHT 
    : '[' WORD_WEIGHT_VALUE ']' 
    ;

PATTERN   
    : ~(' ' | '\t' | '\r' | '\n' )+  
    ;

WS 
    : (' ' | '\t' | '\r' | '\n' )+ -> Skip
    ;


start : (PATTERN | WORD_WEIGHT)* EOF;

The question is, what Lexer rules would give the desired result?

I'm wishing for a feature, a special directive that one can specify for a lexer rule that affects the matching process. It would instruct the lexer, upon a match of the rule, to stop the matching process and use this matched token.

FOLLOW-UP - THE SOLUTION WE CHOSE TO PURSUE:

Replace WORD_WEIGHT above with:

fragment
WORD_WEIGHT 
    : '[' WORD_WEIGHT_VALUE ']'
    ;

WORD_WEIGHTS
    : WORD_WEIGHT (INNER_WS? WORD_WEIGHT)*
    ;

fragment
INNER_WS
    : (' ' | '\t' )+
    ;

Also, the Grammar rule becomes:

start : (PATTERN | WORD_WEIGHTS)* EOF;

Now, any sequence of word weights (either space separated or not), will be the value of WORD_WEIGHTS token. This happens to match our usage too - our grammar (not in the snippet above) always defines word weights as "one or more". Now, the multiplicity is "captured" by the lexer instead of the parser. If/when we need to process each word weight separately we can split the value in the application (parse tree listener).

Sam Harwell Sam Harwell · Accepted Answer · 2014-05-22T18:57:43

You can implement WORD_WEIGHT as follows:

WORD_WEIGHT
  : '[' WORD_WEIGHT_VALUE ']'
    PATTERN?
  ;

Then, in your lexer, you can override the emit method to correct the position of the lexer to remove the PATTERN (if any) which was added to the end of the WORD_WEIGHT token. You can see examples of this in ANTLRWorks 2:

The LBRACE token in StringTemplate 4 is modified by this code.
The DELIMITERS token in StringTemplate 4 is modified by this code.

The modification requires the following steps.

Override LexerATNSimulator to add the resetAcceptPosition method.
Set the _interp field to an instance of your custom LexerATNSimulator in the constructor for your lexer class.
Calculate the desired end position for your token, and call resetAcceptPosition. For fixed-width tokens like you see in the ST4 examples, the calculation was simply the length of the fixed operator or keyword which appeared at the beginning of the token. For your case, you will need to call getText() and examine the result to determing the correct length of your WORD_WEIGHT token. Since the WORD_WEIGHT_VALUE rule cannot match ], the easiest analysis would probably be to find the index of the first ] character in the result of getText() (the syntax of WORD_WEIGHT ensures the character will always exist).

Lexer, overlapping rule, but want the shorter match

1 Answers