0
votes

A vastly simplified version of my lexer rules (within a bigger grammar) are something like the following:

fragment HEX_DIGIT : [0-9A-F] ;
fragment DIGIT : [0-9] ;
SCIENTIFIC : 'E' [+-] ;
INTEGER : DIGIT+ ;
HEX_INTEGER : HEX_DIGIT+ ;
FLOAT_ZERO : '0'* '.' '0'+ ;
FLOAT : DIGIT* '.' DIGIT+ ;

The problem here comes with input such as 00E+00. The tokens I would like out of this are '00', 'E+', '00'. However, Antlr goes the greedy route and parses '00E' as a HEX_INTEGER, and in the full lexer, produces '+' and '00' tokens.

Any suggestions for handling this special case in the lexer? _input.LA() tricks don't seem to work as we are operating at the character level, so I'm not always sure how far I have to look ahead to look for the special 'E+' sequence at the end of the hex number.

2

2 Answers

3
votes

My recommendations are:

  1. Make SCIENTIFIC a fragment rule, and update your INTEGER rule to include support for scientific notation.

    INTEGER : DIGIT+ (SCIENTIFIC DIGIT+)?;
    
  2. Update your HEX_INTEGER rule to not be ambiguous with INTEGER. For example, 777 could be an INTEGER or a HEX_INTEGER. Not all numbers contain the digit a through f in hexadecimal notation.

1
votes

Figured this out after some trial and error, and hope it helps anyone else looking to do something similar. It turns out you can use semantic predicates at more than just the start of your lexer rules, which I didn't realize.

// Tricky, becuase of sci notation- can't catch something like 00E+00, as we
// need tokens like '00', 'E+', '00'. If our number ends in 'E', don't let it
// be followed by '+' or '-'.
HEX_INTEGER
    : HEX_DIGIT*
      {_input.LA(1) != 'E' && _input.LA(2) != '+' && _input.LA(2) != '-'}?
      HEX_DIGIT
    ;