Tokenizing scientific notation in Antlr4 Lexer

Question

A vastly simplified version of my lexer rules (within a bigger grammar) are something like the following:

fragment HEX_DIGIT : [0-9A-F] ;
fragment DIGIT : [0-9] ;
SCIENTIFIC : 'E' [+-] ;
INTEGER : DIGIT+ ;
HEX_INTEGER : HEX_DIGIT+ ;
FLOAT_ZERO : '0'* '.' '0'+ ;
FLOAT : DIGIT* '.' DIGIT+ ;

The problem here comes with input such as 00E+00. The tokens I would like out of this are '00', 'E+', '00'. However, Antlr goes the greedy route and parses '00E' as a HEX_INTEGER, and in the full lexer, produces '+' and '00' tokens.

Any suggestions for handling this special case in the lexer? _input.LA() tricks don't seem to work as we are operating at the character level, so I'm not always sure how far I have to look ahead to look for the special 'E+' sequence at the end of the hex number.

Sam Harwell Sam Harwell · Accepted Answer · 2013-11-20T22:29:46

My recommendations are:

Make SCIENTIFIC a fragment rule, and update your INTEGER rule to include support for scientific notation.
```
INTEGER : DIGIT+ (SCIENTIFIC DIGIT+)?;
```
Update your HEX_INTEGER rule to not be ambiguous with INTEGER. For example, 777 could be an INTEGER or a HEX_INTEGER. Not all numbers contain the digit a through f in hexadecimal notation.

Tokenizing scientific notation in Antlr4 Lexer

2 Answers