Lexer rule conflict resolution needed

Question

I have these lexer rules in my ANTLR3 grammar:

INTEGER:                DIGITS;
FLOAT:                  DIGITS? DOT_SYMBOL DIGITS ('E' (MINUS_OPERATOR | PLUS_OPERATOR)? DIGITS)?;

HEXNUMBER:              '0X' HEXDIGIT+;
HEXSTRING:              'X' '\'' HEXDIGIT+ '\'';

BITNUMBER:              '0B' ('0' | '1')+;
BITSTRING:              'B' '\'' ('0' | '1')+ '\'';

NCHAR_TEXT:             'N' SINGLE_QUOTED_TEXT;

IDENTIFIER:             LETTER_WHEN_UNQUOTED+;

fragment LETTER_WHEN_UNQUOTED:
    '0'..'9'
    | 'A'..'Z' // Only upper case, as we use a case insensitive parser (insensitive only for ASCII).
    | '$'
    | '_'
    | '\u0080'..'\uffff'
;

and

qualified_identifier:
    IDENTIFIER ( options { greedy = true; }: DOT_SYMBOL IDENTIFIER)?
;

This works mostly fine except for very specific situations like the input t1.1_d which is supposed to be parsed as 2 identifiers connected with a dot. What happens is that .1 is matched as float even though it's followed by underscore and letter(s).

It's clear where that comes from: LETTER_WHEN_UNQUOTED includes digits so '1' can be both an integer and an identifier. But the rule order should take care to resolve this to an integer, as intented (and usually does).

However, I'm perplexed the t1.1_d input causes the float rule to kick in and would appreciate some pointers to resolve this problem. As soon as I add a space after the dot all is fine, but that is obviously not a real solution.

When I move the IDENTIFIER rule before the others I get new trouble because several other rules can no longer be matched then. Moving the FLOAT rule after the IDENTIFIER rule doesn't fix the problem either (but at least doesn't produce new problems). In this case we see the actual problem: the dot is always matched by the FLOAT rule if directly followed by a digit. What can I do to make it not match in my case?

Can you include a complete grammar (i.e., include all tokens/fragments referenced in your example)? Also, greedy defaults to true, so specifying it shouldn't be necessary. — Adrian Leonhard
You might consider doing some redesign: currently 1234 is both an INTEGER and an IDENTIFIER. I'm not saying it can't work, but it makes my head hurt. — Adrian Leonhard

Adrian Leonhard Adrian Leonhard · Accepted Answer · 2015-02-24T17:09:40

The problem is that the lexer operates independently of the parser. When faced with the input string t1.1_d, the lexer will first consume an IDENTIFIER, leaving .1_d. You now want it to match DOT_SYMBOL, followed by IDENTIFIER. However, the lexer will always match the longest possible token, resulting in FLOAT matching .1.

Moving IDENTIFIER before FLOAT doesn't help, because '.' isn't a valid IDENTIFIER symbol and so can't match the input at all when it starts with ..

Note that Java and co. don't allow identifiers to start with numbers, probably to avert these kinds of problems.

One possible solution would be to change the FLOAT rule to require digits before the dot: FLOAT: DIGITS '.' DIGITS ...

Lexer rule conflict resolution needed

1 Answers