I want to read an input stream and divide the input into 2 types: PATTERN & WORD_WEIGHT, which are defined below.
The problem arises from the fact that all the chars defined for a WORD_WEIGHT are also valid for a PATTERN. When we have multiple WORD_WEIGHTs without spaces between the lexer will match PATTERN rather than deliver multiple WORD_WEIGHTs.
I need to be able to handle the following cases and get the indicated result:
- [20] => WORD_WEIGHT
- cat => PATTERN
- [dog] => PATTERN
And this case, which is the problem. It matches PATTERN because the lexer will select the longer of the 2 possibilities. Note: there's no space between them.
- [20][30] => WORD_WEIGHT WORD_WEIGHT
Also need to handle this case (which imposes some limits on the possible solutions). Note that the brackets may not be matching for a PATTERN...
- [[[cat] => PATTERN
Here's the grammar:
grammar Brackets;
fragment
DIGIT
: ('0'..'9')
;
fragment
WORD_WEIGHT_VALUE
: ('-' | '+')? DIGIT+ ('.' DIGIT+)?
| ('-' | '+')? '.' DIGIT+
;
WORD_WEIGHT
: '[' WORD_WEIGHT_VALUE ']'
;
PATTERN
: ~(' ' | '\t' | '\r' | '\n' )+
;
WS
: (' ' | '\t' | '\r' | '\n' )+ -> Skip
;
start : (PATTERN | WORD_WEIGHT)* EOF;
The question is, what Lexer rules would give the desired result?
I'm wishing for a feature, a special directive that one can specify for a lexer rule that affects the matching process. It would instruct the lexer, upon a match of the rule, to stop the matching process and use this matched token.
FOLLOW-UP - THE SOLUTION WE CHOSE TO PURSUE:
Replace WORD_WEIGHT above with:
fragment
WORD_WEIGHT
: '[' WORD_WEIGHT_VALUE ']'
;
WORD_WEIGHTS
: WORD_WEIGHT (INNER_WS? WORD_WEIGHT)*
;
fragment
INNER_WS
: (' ' | '\t' )+
;
Also, the Grammar rule becomes:
start : (PATTERN | WORD_WEIGHTS)* EOF;
Now, any sequence of word weights (either space separated or not), will be the value of WORD_WEIGHTS token. This happens to match our usage too - our grammar (not in the snippet above) always defines word weights as "one or more". Now, the multiplicity is "captured" by the lexer instead of the parser. If/when we need to process each word weight separately we can split the value in the application (parse tree listener).