Matching parentheses in ANTLR

Question

I'm new to Antlr and I met one issue with parentheses matching recently. Each node in the parse tree has the form (Node1,W1,W2,Node2), where Node1 and Node2 are two nodes and W1 and W2 are two weights between them. Given an input file as (1,C,10,2).((2,P,2,3).(3,S,3,2))*.(2,T,2,4), the parse tree looks wrong, where the operator is not the parent of those nodes and the parentheses are not matched.

The parse file I wrote is like this:

grammar Semi;

prog
    : expr+
    ;
    
expr
    : expr '*'   
    | expr ('.'|'+') expr 
    | tuple   
    | '(' expr ')' 
    ;

tuple
    : LP NODE W1 W2 NODE RP
    ;

LP  : '(' ;
RP  : ')' ;
W1  : [PCST0];
W2  : [0-9]+;
NODE: [0-9]+;


WS  : [ \t\r\n]+ -> skip ;    // toss out whitespace
COMMA: ',' -> skip;

It seems like expr| '(' expr ')' doesn't work correctly. So what should I do to make this parser detects if parentheses belong to the node or not? Update: There are two errors in the command:

line 1:1 no viable alternative at input '(1'
line 1:13 no viable alternative at input '(2'

So it seems like the lexer didn't detect the tuples, but why is that?

I suspect your issue has nothing to do with the '(' expr ')' production, but with your lexical rules. If you look at the tokens generated for your input, you'll likely find that they aren't what you expect. — sepp2k
@sepp2k yes I just realized that but I don't know how to fix it..I also found an error "no viable alternative at input '(1' " in the command — BrandNewStory
First of all, you shouldn't have multiple lexer rules that match the exact same pattern. That just leads to one of them being dead. Then another problem is that a single 0 by itself will produce a W1 token, but presumably you want to allow 0s as W2 or nodes as well. The easiest solution for that is probably to have dedicated lexer rule just for '0', which can be used anywhere that a W1, W2 or node can be used. — sepp2k
@sepp2k omg thank you so much! I just tried to use only one lexer for numbers and now it works perfectly. I just have one more question regarding your reply, if I use a comma as delimiters, can that single 0 issue be avoided? It seems so in the output parse tree, but I'm not sure. — BrandNewStory
The issue with 0 is that an input such as (0, 0, 0, 0), which should be valid according to my understanding of your format, would be lexed as LP W1 W1 W1 W1 RP, which doesn't match your tuple rule. Now you could change tuple to allow W1 everywhere, but then (P, P, P, P) would also be allowed, which I believe you don't want. So I'm suggesting to change it so that 0 12 P gets lexed as ZERO NUMBER W1_LETTER. Then you could define w2 and node (as parser rules, not lexer rules) as ZERO | NUMBER and you could define w1 as ZERO | W1_LETTER. — sepp2k

Mike Cargal Mike Cargal · Accepted Answer · 2021-03-11T14:50:08

Your W2 and NODE rules are the same, so nodes you intend to be NODE are matching W2.

grun with -tokens option: (notice, no NODE tokens)

[@0,0:0='(',<'('>,1:0]
[@1,1:1='1',<W2>,1:1]
[@2,3:3='C',<W1>,1:3]
[@3,5:6='10',<W2>,1:5]
[@4,8:8='2',<W2>,1:8]
[@5,9:9=')',<')'>,1:9]
[@6,10:10='.',<'.'>,1:10]
[@7,11:11='(',<'('>,1:11]
[@8,12:12='(',<'('>,1:12]
[@9,13:13='2',<W2>,1:13]
[@10,15:15='P',<W1>,1:15]
[@11,17:17='2',<W2>,1:17]
[@12,19:19='3',<W2>,1:19]
[@13,20:20=')',<')'>,1:20]
[@14,21:21='.',<'.'>,1:21]
[@15,22:22='(',<'('>,1:22]
[@16,23:23='3',<W2>,1:23]
[@17,25:25='S',<W1>,1:25]
[@18,27:27='3',<W2>,1:27]
[@19,29:29='2',<W2>,1:29]
[@20,30:30=')',<')'>,1:30]
[@21,31:31=')',<')'>,1:31]
[@22,32:32='*',<'*'>,1:32]
[@23,33:33='.',<'.'>,1:33]
[@24,34:34='(',<'('>,1:34]
[@25,35:35='2',<W2>,1:35]
[@26,37:37='T',<W1>,1:37]
[@27,39:39='2',<W2>,1:39]
[@28,41:41='4',<W2>,1:41]
[@29,42:42=')',<')'>,1:42]
[@30,43:42='<EOF>',<EOF>,1:43]

If I replace the NODEs in your parse rule with W2s (sorry, I have no idea what this is supposed to represent), I get:

It appears that your misconception is that the recursive descent parsing starts with the parser rule and when it encounters a Lexer rule, attempts to match it.

This is not how ANTLR works. With ANTLR, your input is first run through the Lexer (aka Tokenizer) to produce a stream of tokens. This step knows absolutely nothing about your parser rules. (That's why it's so often useful to use grun to dump the stream of tokens, this gives you a picture of what your parser rules are acting upon (and you can see, in your example that there are no NODE tokens, because they all matched W2).

Also, a suggestion... It would appear that commas are an essential part of correct input (unless (1C102).((2P23).(3S32))*.(2T24) is considered valid input. On that assumption, I removed the -> skip and added them to your parser rule (that's why you see them in the parse tree). The resulting grammar I used was:

grammar Semi;

prog: expr+;

expr: expr '*' | expr ('.' | '+') expr | tuple | LP expr RP;

tuple: LP W2 COMMA W1 COMMA W2 COMMA W2 RP;

LP: '(';
RP: ')';
W1: [PCST0];
W2: [0-9]+;
NODE: [0-9]+;

WS: [ \t\r\n]+ -> skip; // toss out whitespace
COMMA: ',';

To take a bit more liberty with your grammar, I'd suggest that your Lexer rules should be raw type focused. And, that you can use labels to make the various elements or your tuple more easily accessible in your code. Here's an example:

grammar Semi;

prog: expr+;

expr: expr '*' | expr ('.' | '+') expr | tuple | LP expr RP;

tuple: LP nodef=INT COMMA w1=PCST0 COMMA w2=INT COMMA nodet=INT RP;

LP: '(';
RP: ')';
PCST0: [PCST0];
INT: [0-9]+;

COMMA: ',';
WS: [ \t\r\n]+ -> skip; // toss out whitespace

With this change, your tuple Context class will have accessors for w1, w1, and node. node will be an array of NBR tokens as I've defined it here.

Matching parentheses in ANTLR

1 Answers