Ambiguous Lexer rules in Antlr

Question

I have an antlr grammar with multiple lexer rules that match the same word. It can't be resolved during lexing, but with the grammar, it becomes unambiguous.

Example:

conversion: NUMBER UNIT CONVERT UNIT;
NUMBER: [0-9]+;
UNIT: 'in' | 'meters' | ......;
CONVERT: 'in';

Input: 1 in in meters

The word "in" matches the lexer rules UNIT and CONVERT.

How can this be solved while keeping the grammar file readable?

BernardK BernardK · Accepted Answer · 2017-11-16T00:46:34

When an input matches two lexer rules, ANTLR chooses either the longest or the first, see disambiguate. With your grammar, in will be interpreted as UNIT, never CONVERT, and the rule

conversion: NUMBER UNIT CONVERT UNIT;

can't work because there are three UNIT tokens :

$ grun Question question -tokens -diagnostics input.txt 
[@0,0:0='1',<NUMBER>,1:0]
[@1,1:1=' ',<WS>,channel=1,1:1]
[@2,2:3='in',<UNIT>,1:2]
[@3,4:4=' ',<WS>,channel=1,1:4]
[@4,5:6='in',<UNIT>,1:5]
[@5,7:7=' ',<WS>,channel=1,1:7]
[@6,8:13='meters',<UNIT>,1:8]
[@7,14:14='\n',<NL>,1:14]
[@8,15:14='<EOF>',<EOF>,2:0]
Question last update 0159
line 1:5 missing 'in' at 'in'
line 1:8 mismatched input 'meters' expecting <EOF>

What you can do is to have only ID or TEXT tokens and distinguish them with a label, like this :

grammar Question;

question
@init {System.out.println("Question last update 0132");}
    :   conversion NL EOF
    ;

conversion
    :   NUMBER unit1=ID convert=ID unit2=ID
        {System.out.println("Quantity " + $NUMBER.text + " " + $unit1.text +
         " to convert " + $convert.text + " " + $unit2.text);}
    ;

ID      : LETTER ( LETTER | DIGIT | '_' )* ; // or TEXT : LETTER+ ;     
NUMBER  : DIGIT+ ;

NL      : [\r\n] ;
WS      : [ \t] -> channel(HIDDEN) ; // -> skip ;

fragment LETTER : [a-zA-Z] ;
fragment DIGIT  : [0-9] ;

Execution :

$ grun Question question -tokens -diagnostics input.txt 
[@0,0:0='1',<NUMBER>,1:0]
[@1,1:1=' ',<WS>,channel=1,1:1]
[@2,2:3='in',<ID>,1:2]
[@3,4:4=' ',<WS>,channel=1,1:4]
[@4,5:6='in',<ID>,1:5]
[@5,7:7=' ',<WS>,channel=1,1:7]
[@6,8:13='meters',<ID>,1:8]
[@7,14:14='\n',<NL>,1:14]
[@8,15:14='<EOF>',<EOF>,2:0]
Question last update 0132
Quantity 1 in to convert in meters

Labels are available from the rule's context in the visitor, so it is easy to distinguish tokens of the same type.

Ambiguous Lexer rules in Antlr

2 Answers