ANTLR, Trouble with expression grammar

Question

I've recently started using ANTLR. I'm currently trying to encode an expression grammar with +, -, * and array[index] and a few more constructs.

This is the desired grammar:

Exp -> Exp (+ | - | * | < | &&) Exp
     | Exp [ Exp ]
     | -Exp
     | ( Exp )
     | Exp.length
     | true
     | false
     | Id
     | this
     | ! Exp

I first refactored this into AndExp, SumExp, ProdExp and so on to resolve precedence. Roughly like this:

Exp        -> AndExp
AndExp     -> CmpExp (&& CmpExp)*
CmpExp     -> SumExp (< SumExp)*
SumExp     -> ProdExp ((+|-) ProdExp)*
ProdExp    -> UnaryExp (Times UnaryExp)*
UnaryExp   -> Minus* PrimaryExp
PrimaryExp -> Exp.length
            | Exp [ Exp ]
            | ! Exp
            | true
            | false
            | Id
            | this

I then realized that this uses left-recursion, and that ANTLR doesn't like that. I went on to eliminate the left-recursion and I ended up with this grammar:

grammar test;

options {
    language=Java;
    output=AST;
    backtrack=true;
}

start      : expression;

expression : andExp;
andExp     : cmpExp (And^ cmpExp)*;
cmpExp     : sumExp (LessThan^ sumExp)*;
sumExp     : prodExp ((Plus | Minus)^ prodExp)*;
prodExp    : unaryExp (Times^ unaryExp)*;
unaryExp   : Minus* primaryExp;

primaryExp : INT                   primaryPrime
           | 'true'                primaryPrime
           | 'false'               primaryPrime
           | 'this'                primaryPrime
           | ID                    primaryPrime
           | '!' expression        primaryPrime
           | '('! expression ')'!  primaryPrime
           ;


primaryPrime
           : '[' expression ']'             primaryPrime
           | '.' ID '(' exprList ')'        primaryPrime
           | '.length'                      primaryPrime
           | 'new' 'int' '[' expression ']' primaryPrime
           | 'new' ID '(' ')'               primaryPrime
           |
           ;


exprList   :
           | expression (',' expression)*;

LessThan   : '<';
Plus       : '+';
Minus      : '-';
Times      : '*';
And        : '&&';
Not        : '!';
INT        :    '0' | ('1'..'9')('0'..'9')*;
ID         :    ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
WS         : ('\t' | ' ' | '\r' | '\n'| '\u000C') { $channel=HIDDEN; } ;

Q1: Is backtracking "required" for this type of grammar (I can't get it through ANTLR unless I activate it) or am I missing something simple?
Q2: When adding a few ^ and -> ^(TOKEN ...) constructs to brush up the tree, I ran into the following annoying situation (because of the primaryPrime which is due to the left factoring):
```
primaryPrime
    : '[' expression ']' primaryPrime -> ^(ARRAY_EXPR expression)
//...
```
This turns an array[index] into
```
array
ARRAY_EXPR
    index
```
while I really want
```
ARRAY_EXPR
    array
    index
```
What is the best way to solve this? Am I on the right track here, or should I go with some other approach all together?

Bart Kiers Bart Kiers · Accepted Answer · 2011-03-11T14:31:11

A1

No, backtracking is not (yet) required. But if you do need some backtracking, it's advisable to not use backtrack=true right away but use predicate before the rules that do need backtracking. By using backtrack=true, you're enabling backtracking on all of your rules, while it's probably only one or two needing backtracking. But, if your language will be relatively small, backtrack=true is easier than mixing in predicates by hand, and probably won't have a big impact on performance. But if you can avoid them, do so.

You have a couple of parser rules that match empty strings, which are causing the problems. You'd usually better let rules match something, and make the rule optional. So instead of:

foo : bar baz ;
bar : 'bar' ;
baz : 'baz' | /* epsilon */ ;

do

foo : bar baz? ;
bar : 'bar' ;
baz : 'baz' ;

instead.

Also, in case of reserved keywords like true, false etc., don't mix them in your parser rules: always explicitly define them at the top of your lexer rules. Lexer rules are matched starting from top to bottom, so it safest to define them (at least) before rules like ID that could possible match them as well. I usually put them as first lexer rules.

A2

You could do that by passing parameters to your parser rules, although that makes your grammar (a bit) less readable.

Your grammar with my comments:

grammar test;

options {
  output=AST;
}

tokens {
  ARRAY_EXPR;
}

start      : expression;

expression : andExp;
andExp     : cmpExp (And^ cmpExp)*;
cmpExp     : sumExp (LessThan^ sumExp)*;
sumExp     : prodExp ((Plus | Minus)^ prodExp)*;
prodExp    : unaryExp (Times^ unaryExp)*;
unaryExp   :  '-' primaryExp
           |  '!' primaryExp // negation is really a `unaryExp`
           |  primaryExp
           ;

primaryExp : INT                  primaryPrime[null]?
           | 'true'               primaryPrime[null]?
           | 'false'              primaryPrime[null]?
           | 'this'               primaryPrime[null]?
           | (ID -> ID)           (primaryPrime[new CommonTree($ID)] -> primaryPrime)?
           | '('! expression ')'! primaryPrime[null]?
           ;

// removed the matching of 'epsilon'
primaryPrime [CommonTree parent]
           : '[' expression ']'             primaryPrime[null]? -> ^(ARRAY_EXPR {parent} expression primaryPrime?)
           | '.' ID '(' exprList? ')'       primaryPrime[null]?
           | '.length'                      primaryPrime[null]?
           | 'new' 'int' '[' expression ']' primaryPrime[null]?
           | 'new' ID '(' ')'               primaryPrime[null]?
           ;

// removed the matching of 'epsilon' 
exprList   : expression (',' expression)*;

// be sure to create explicit tokens for keywords!
True       : 'true';
False      : 'false';
This       : 'this';
LessThan   : '<';
Plus       : '+';
Minus      : '-';
Times      : '*';
And        : '&&';
Not        : '!';
INT        : '0' | ('1'..'9')('0'..'9')*;
ID         : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
WS         : ('\t' | ' ' | '\r' | '\n'| '\u000C') { $channel=HIDDEN; } ;

will parse the input "array[2*3]" into the following AST:

enter image description here

as you can see by running the following test class:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String source = "array[2*3]";
    testLexer lexer = new testLexer(new ANTLRStringStream(source));
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    testParser parser = new testParser(tokens);
    testParser.start_return returnValue = parser.start();
    CommonTree tree = (CommonTree)returnValue.getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

ANTLR, Trouble with expression grammar

1 Answers

A1

A2