Unexpected behaviour when parsing a string with optional Suffix in antlr4

Question

I want to match multiple Functions to accept a comma-seperated List of placeholders and then the definition of a Unit, which is again seperated by a comma from the rest of the arguments. The text to parse would look like example 1: "produkt([F1],[F2],EURO_CENT)" or example 2:"produkt([F1],[F2],EURO)"

The grammar for this like i would expect it to work is this:

[...]

term: [...]
    | 'produkt(' placeholder ',' placeholder ',' UNIT ')' #MultUnit
    [...]
    | placeholder #PlaceholderTwo
    ;

WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines

LBRACK: '[';
RBRACK: ']';
PLACE: TEXT+ NUMBER?;

placeholder: LBRACK PLACE+ RBRACK;

[..]

UNIT: TEXT (('_' TEXT)*)?;

TEXT: ('a' .. 'z' | 'A' .. 'Z')+;//[a-zA-Z]+;

[...]

With this grammar example 1 works as expected but example 2 gives me the error "line 1:18 mismatched input 'EURO' expecting UNIT". As i understand it this means that "EURO" itself does not match the pattern for UNIT but "EURO_CENT" does. I do not understand why this is the case because the pattern for UNIT says that the "_CENT" part is optional and only the first part is mandatory.
I also tried to give the UNIT some Prefix (in this case "Unit.") by changing the pattern for Unit to UNIT: 'Unit.' TEXT ('_' TEXT)*;
I changed the input string to "produkt([F1],[F2],Unit.EURO)" accordingly and this matches like a charme.
However the second approach is not very userfriendly since we have to add something (in our opinion) unnecessary to the input. So the question is: why does the first option not match as expected when the UNIT-String is a single word and is there a workaround for it?

GRosenberg GRosenberg · Accepted Answer · 2015-11-30T21:18:18

The short answer is that PLACE and UNIT are mutually ambiguous for content that only matches TEXT. If the sample inputs are canonical, then change the PLACE rule to remove the ambiguity:

PLACE : TEXT+ NUMBER ;

Other possibilities include redefining PLACE as

PLACE : LBRACK TEXT+ NUMBER? RBRACK; // adjust other rules accordingly

adding a predicate to the rule:

PLACE : {followsLBRACK()}? TEXT+ NUMBER ;

and redefining UNIT:

UNIT: TEXT ( 'S' | ( '_' TEXT )+ ) ; // EUROS or EURO_CENT; similar for other units.

BTW, Antlr generally evaluates its grammars top-down, so mixing your rules as you have actually obfuscates the logic.

Unexpected behaviour when parsing a string with optional Suffix in antlr4

1 Answers