2
votes

I have an antlr rule that processes AND and OR expressions. It looks like this:

expr : expr 'AND' expr 
     | expr 'OR' expr 
     | 'a' | 'b' | 'c' | 'd'; 

This results in a parse tree that's very deep. E.g. if you have

a AND b AND c AND d

results in a tree like this:

              expr
           /    |  \
        expr   AND  d
     /    |  \
   expr AND  c
 / | \
a AND b

This can get pretty deep and expensive to evaluate so I wanted to add an optimization. I want to process multiple sequential AND expressions together (similarly for OR-s).

So I want to do something like this:

expr : expr ('AND' expr)+
     | expr ('OR' expr)+
     | 'a' | 'b' | 'c' | 'd'; 

And I thought that would generate one node for all AND-s in a sequence.

However, when I do that antlr still chooses to generate the recursive tree. I suppose that's because the rule is ambiguous. Any ideas on how to get it to be flatter? Is it a matter of ordering of rules or something like that? The reason I care for the depth is because of performance effects due to deep recursion.

2

2 Answers

2
votes

You can easily do it, if you have several rules like in older grammars (C grammar).

expr:   orExpr
    ;

orExpr: andExpr ('OR' andExpr)*
        ;

andExpr : primExpr ('AND' primExpr)*
        ;

primExpr:'a' | 'b' | 'c' | 'd'; 

WS : ' ' -> skip;

sample text:

a AND b AND c AND d

result:

resulting parse tree

2
votes

Is the recursion creating an actual, measured performance problem. If so, can you quantify the extent of recursion you are dealing with. Antlr is usually quite good at handling recursion, so if you are experiencing a real performance problem, it may be due to a deeper issue in Antlr. Ter and Sam will need to reproduce it in order to deal with it.

That said, every instance of expr on the rule rhs will create a recursion. Grouping and limiting instances

expr : expr ('AND' | 'OR') expr 
     | 'a' | 'b' | 'c' | 'd'
     ; 

won't change the number of recursions necessary to satisfy the rule -- that depends on the data being processed.

What might be a greater gain is, if your performance problem actually stems from a lot of deep recursion rule failures, is to restructure your grammar to fail the rule(s) sooner or limit the number of rules (or subrules) potentially applicable for any given sequence of data.

Exactly how is difficult to say based on the information provided so far.

Updated

To clarify, a recursive rule call in Antlr is not implemented differently from any other rule call; it is not implemented by a recursive Java-method call.

The LL(*) algorithm of Antlr is implicitly a sequential path-solver operating against a precomputed valid-state network. Checkpoint information is preserved at each decision point, which includes rule calls, for backtracking. The checkpoint data necessary to capture a rule call state transition is relatively simple and not sensitive to whether ; the target node represents a grammar-based recursive call or not. Performance largely correlates to the number of rule node evaluations, including in particular all those of any rule subpaths tried and eventually failed.

That is why unrolling a recursive rule into several rules is unlikely to improve performance. If they that collectively implement the same recursive function, then at best, execution against the same input will require the same number of rule calls. At worst, the non-minimal expression of rules will require more rule calls and, likely, will incur more internal check points (every * and + parenthetical is implicitly guarded).

Assuming that your grammar is optimally minimized, and that the order of the 20+ expression rules is statistically optimized for your source data, you could could optimize further by only walking the exact subtree that will actually match the next sequence of source data.

Just statistically predict the correct subtree rule in advance. Given the number of times you are evaluating the parse tree, you should be able to fairly quickly correlate some simply computed signature of source data sequences with the subtrees eventually matched. Or at least decide a more appropriate ordering of subtrees to try.

Exactly what the signature function should be, or if the cost-benefit will be positive, is difficult to say based on the information provided so far. Really depends on the nature of your source data.