3
votes

If we have a language consisting only of atomic elements and unary and binary operators:

atomic elements: a b c
unary operators: ! ~ + -
binary operators: + - / *

Then we can define a grammar:

ATOM := a | b | c
UNOP := ! | ~ | + | -
BINOP := + | - | / | *
EXPR := ATOM | UNOP EXPR | EXPR BINOP EXPR

However this grammar leads to an ambiguous parse tree (and an infinite loop in a recursive descent parser due to left recursion).

So we add a precendence table:

Precendence 1: unary+ unary- ~ ! (Right to Left)
Precendence 2: * / (Left to Right)
Precendence 3: binary+ binary- (Left to Right)

My question is by what algorithm or procedure can we take the precedence table and produce an appropriate grammar for a recursive descent parser (not left-recursive).

A precedence table is an ordered list of operator groups and associated directions (L->R or R<-L). The answer would be something that takes this as input and produces grammars as output.

2

2 Answers

3
votes

The general grammar that describes arbitrary precedence can be parsed using LALR parsers which are table based and can be generated using yacc. But this doesn't mean all is lost when you wish to use recursive descent parsers.

The recursive descent parser can only verify whether the syntax is correct. Building a syntax tree is a different matter and precedence should be handled on the tree building level.

So consider the following grammar without left recursion which can parse infix expressions. Nothing special no sign of precedence:

Expr := Term (InfixOp Term)*
InfixOp := '+' | '-' | '*' | '/'
Term := '(' Expr ')'
Term := identifier

(The notation used on the right side is regex like, the rules that have substitution written using large camel case, tokens are quoted or written using small camel case).

When building the syntax tree you have a current node which you add new nodes to.

Most often when you parse a rule you create a new child node on the current node and make that child the current node. When finished with the parsing you step up to the parent node.

Now this is what should be done differently when you parse the InfixOp rule. You should assign precedence strength to the relevant nodes. The Expr node have the weakest precedence, while all other operators have stronger ones.

Handling infix precedence

When parsing the InfixOp rule do the following:

  1. While the current node's precedence stronger than the incoming node's precedence, keep going up one level (make the parent node the current).

  2. Then insert the node for the incoming one as a parent of the last child of the current node and make it current.

Since the Expr node declared to have the weakest precedence it will ultimately stop the climbing.

Let's see an example: A+B*C

There the current node always marked with ! after consuming the current token.

Parsed: none

Expr!

Parsed: A

Expr!
|
A

Parsed: A+

Expr
|
+!
|
A

Parsed: A+B

  Expr
  |
  +!
 / \
A   B

Parsed: A+B*

  Expr
  |
  +
 / \
A   *!
   /
  B

Parsed: A+B*C

  Expr
  |
  +
 / \
A   *!
   / \
  B   C

If you traverse this in postorder way, you will get reverse polish notation for the expression which can be used to evaluate it.

Or the reverse an example: A*B+C

Parsed: none

Expr!

Parsed: A

Expr!
|
A

Parsed: A*

Expr
|
*!
|
A

Parsed: A*B

  Expr
  |
  *!
 / \
A   B

Parsed: A*B+

  Expr
  |
  +!
  |
  *
 / \
A   B

Parsed: A*B+C

    Expr
    |
    +!
   / \
  *   C
 / \
A   B

Handling associativity

There are operators that are left associative while others are right associative. For example in the C language family the + is left associative while the = is right associative.

Actually the whole associativity thing is boils down to the handling of operators on the same precedence level. For left associative operators when climbing keep going up when you encounter an operator on the same precedence level. For right associative operators, stop when you encounter the same operator.

(It takes too much space to demonstrate all techniques, I recommend trying it out on a piece of paper.)

Handling prefix and postfix operators

In this case you need to modify the grammar a bit:

Expr := PrefixOp* Term PostfixOp* (InfixOp PrefixOp* Term PostfixOp*)*
InfixOp := '+' | '-' | '*' | '/'
Term := '(' Expr ')'
Term := identifier

When you encounter a prefix operator, just add it as a new child to the current node and make the new child as current, regardless of precedence, it will be correct even if it's a strong operator or a weak one, the precedence climbing rules of the infix operators ensure the correctness.

For postfix operators you can use the same precedence climbing I described at infix operators, the only difference that we don't have a right side for postfix operators, so it will have only 1 child.

Handling ternary operators

The C language family has the ?: ternary operator. With regard to the syntax tree building you can handle the ? and : as separate infix operators. But there is a trick. The node you create for the ? should be an incomplete ternary node, which means you do the usual precedence climbing and place it, but this incomplete node will have lowest precedence, this prevents even weaker operators like comma operator climb over it. When you reach the : you must climb up till the first incomplete ternary node (if you don't find one, then report syntax error), then change it to a complete node which will have normal precedence, and make it current. If you reach the end of the expression unexpectedly when there are incomplete ternary nodes on the current branch, again report a syntax error.

So the a , b ? c : d is interpreted as a , (b ? c : d).

But the a ? c , d : e will be interpreted as a ? (c , d) : e, since we prevented the climbing of the comma over the ?.

Handling array indexes and function calls

Despite the postfix appearance they are infix operators with syntactically enforced parenthesized term on the right, this is true for array indexes and function calls as well.

2
votes

It's easy enough to convert an operator precedence grammar into an LR(1) grammar [1], but the resulting grammar will use left recursion to parse left associative operators. It's easy enough to eliminate the left recursion -- for example, make all operators right associative -- but while the resulting grammar recognizes the same language, the parse trees are different.

It turns out that it's not hard to slightly modify the recursive descent parser to be able to handle precedence relations. The technique was invented by Vaughan Pratt, and essentially uses the call-stack to substitute the explicit stack in the classic shunting-yard algorithm.

Pratt parsing seems to be undergoing some sort of revival, and you can find lots of blog posts about it; one reasonably good one is by Eli Bendersky. Pratt devised the procedure in the early 1970s, about the same time Frank deRemer was proving that LR(1) parsing was practical. Pratt was skeptical about both the practicality and the inflexibility of formal parsing. I think the debate has pretty well been simmering ever since. Pratt parsers are indeed simple and flexible, but on the other hand it can be very difficult to prove that they are correct (or that they parse a particular formally-described grammar). On the other hand, although bison has recently acquired support for GLR parsing, making it potentially a lot less fidgety to use, and despite the fact that bison-generated parsers actually parse the grammar they claim to parse, there are still many who would agree with Pratt's statement (from 1973) that formal parsing methods are "less accessible and less pleasant to use".


[1] In practice, all yacc-derivatives and many other LR parser generators will accept precedence relations for disambiguating; the resulting grammar tables are smaller and involve fewer unit reductions, so there is no particularly good reason not to use this technique if you're going to use a parser generator.