The general grammar that describes arbitrary precedence can be parsed using LALR parsers which are table based and can be generated using yacc. But this doesn't mean all is lost when you wish to use recursive descent parsers.
The recursive descent parser can only verify whether the syntax is correct. Building a syntax tree is a different matter and precedence should be handled on the tree building level.
So consider the following grammar without left recursion which can parse infix expressions. Nothing special no sign of precedence:
Expr := Term (InfixOp Term)*
InfixOp := '+' | '-' | '*' | '/'
Term := '(' Expr ')'
Term := identifier
(The notation used on the right side is regex like, the rules that have substitution written using large camel case, tokens are quoted or written using small camel case).
When building the syntax tree you have a current node which you add new nodes to.
Most often when you parse a rule you create a new child node on the current node and make that child the current node. When finished with the parsing you step up to the parent node.
Now this is what should be done differently when you parse the InfixOp
rule. You should assign precedence strength to the relevant nodes. The Expr
node have the weakest precedence, while all other operators have stronger ones.
Handling infix precedence
When parsing the InfixOp
rule do the following:
While the current node's precedence stronger than the incoming node's precedence, keep going up one level (make the parent node the current).
Then insert the node for the incoming one as a parent of the last child of the current node and make it current.
Since the Expr
node declared to have the weakest precedence it will ultimately stop the climbing.
Let's see an example: A+B*C
There the current node always marked with !
after consuming the current token.
Parsed: none
Expr!
Parsed: A
Expr!
|
A
Parsed: A+
Expr
|
+!
|
A
Parsed: A+B
Expr
|
+!
/ \
A B
Parsed: A+B*
Expr
|
+
/ \
A *!
/
B
Parsed: A+B*C
Expr
|
+
/ \
A *!
/ \
B C
If you traverse this in postorder way, you will get reverse polish notation for the expression which can be used to evaluate it.
Or the reverse an example: A*B+C
Parsed: none
Expr!
Parsed: A
Expr!
|
A
Parsed: A*
Expr
|
*!
|
A
Parsed: A*B
Expr
|
*!
/ \
A B
Parsed: A*B+
Expr
|
+!
|
*
/ \
A B
Parsed: A*B+C
Expr
|
+!
/ \
* C
/ \
A B
Handling associativity
There are operators that are left associative while others are right associative. For example in the C language family the +
is left associative while the =
is right associative.
Actually the whole associativity thing is boils down to the handling of operators on the same precedence level. For left associative operators when climbing keep going up when you encounter an operator on the same precedence level. For right associative operators, stop when you encounter the same operator.
(It takes too much space to demonstrate all techniques, I recommend trying it out on a piece of paper.)
Handling prefix and postfix operators
In this case you need to modify the grammar a bit:
Expr := PrefixOp* Term PostfixOp* (InfixOp PrefixOp* Term PostfixOp*)*
InfixOp := '+' | '-' | '*' | '/'
Term := '(' Expr ')'
Term := identifier
When you encounter a prefix operator, just add it as a new child to the current node and make the new child as current, regardless of precedence, it will be correct even if it's a strong operator or a weak one, the precedence climbing rules of the infix operators ensure the correctness.
For postfix operators you can use the same precedence climbing I described at infix operators, the only difference that we don't have a right side for postfix operators, so it will have only 1 child.
Handling ternary operators
The C language family has the ?:
ternary operator. With regard to the syntax tree building you can handle the ?
and :
as separate infix operators. But there is a trick. The node you create for the ?
should be an incomplete ternary node, which means you do the usual precedence climbing and place it, but this incomplete node will have lowest precedence, this prevents even weaker operators like comma operator climb over it. When you reach the :
you must climb up till the first incomplete ternary node (if you don't find one, then report syntax error), then change it to a complete node which will have normal precedence, and make it current. If you reach the end of the expression unexpectedly when there are incomplete ternary nodes on the current branch, again report a syntax error.
So the a , b ? c : d
is interpreted as a , (b ? c : d)
.
But the a ? c , d : e
will be interpreted as a ? (c , d) : e
, since we prevented the climbing of the comma over the ?.
Handling array indexes and function calls
Despite the postfix appearance they are infix operators with syntactically enforced parenthesized term on the right, this is true for array indexes and function calls as well.