0
votes

So I've been trying to parse a haskell-like language grammar with bison. I'll omit the standard problems with grammars and unary minus (like, what is (-5) from -5 and \x->x-5 or if a-b is a-(b) or apply a (-b) which itself can still be apply a \x->x-b, haha.) and go straight to the thing that suprised me.

To simplify the whole thing to the point where it matters, consider following situation:

expression: '(' expression ')'
    | expression expression            /* lambda application */
    | '\\' IDENTIFIER "->" expression  /* lambda abstraction */
    | expression '+' expression        /* some operators to play with */
    | expression '*' expression
    | IDENTIFIER | CONSTANT            /* | ..... */
    ;

I solved all shift/reduce conflicts with '+' and '*' with %left and %right precedence macros, but I somehow failed to find any good solution how to set precedence for the expression expression lambda application. I tried %precedence and %left and %prec marker as shown for example here %http://www.gnu.org/software/bison/manual/html_node/Non-Operators.html#Non-Operators, but it looks like bison is completely ignoring any precedence setting on this rule. At least all combinations I tried failed. Documentation on exactly this topic is pretty sparse, whole thing looks like suited only for handling the "classic" expr. OPER expr. case.

Question: Am I doing something wrong, or is this impossible in Bison? If not, is it just unsupported or is there some theoretical justification why not?

Remark: Of course there's an easy workaround to force left-folding and precedence that would look schematically like

expression: expression_without_lambda_application
    | expression expression_without_lambda_application
    ;

expression_without_lambda_application: /* ..operators.. */
    | '(' expression ')'
    ;

...but that's not as neat as it could be, right? :]

Thanks!

1

1 Answers

3
votes

It's easiest to understand how bison precedence works if you understand how LR parsing works, since it's based on a simple modification of the LR algorithm. (Here, I'm just combining SLR, LALR and LR grammars, because the basic algorithm is the same.)

An LR(1) machine has two possible classes of action:

  1. Reduce the right-hand side of the production which ends just before the lookahead token (and consequently is at the top of the stack).
  2. Shift the lookahead token.

In an LR(1) grammar, the decision can always be made on the basis of the machine state and the lookahead token. But certain common constructs -- notably infix expressions -- apparently require grammars which appear more complicated than they need to be, and which require more unit reductions than should be necessary.

In an era in which LR parsing was new, and most practitioners were used to some sort of operator precedence grammar (see below for definition), and in which cycles were a lot more expensive than they are now so that the extra unit reductions seemed annoying, the modification of the LR algorithm to use standard precedence techniques was attractive.

The modification -- which is based on a classic algorithm for parsing operator precedence grammars -- involves assigning a precedence value (an integer) to every right-hand side (i.e. every production) and to every terminal. Then, when constructing the LR machine, if a given state and lookahead can trigger either a shift or a reduce action, the conflict is resolved by comparing the precedence of the possible reduction with the precedence of the lookahead token. If the reduction has a higher precedence, it wins; otherwise the machine shifts.

Note that reduction precedences are never compared with each other, and neither are token precedences. They can actually come from different domains. Furthermore, for a simple expression grammar, intuitively the comparison is with the operator "at the top of the stack"; this is actually accomplished by using the right-most terminal in a production to assign the precedence of the production. To handle left vs. right associativity, we don't actually use the same precedence value for a production as for a terminal. Left-associative productions are given a precedence slightly higher than the terminal's precedence, and right-associative productions are given a precedence slightly lower. This could be done by making the terminal precedences multiples of 3 and the reduction precedences a value one greater or less than the terminal. (Actually in practice the comparison is > rather than so it's possible to use even numbers for terminals, but that's an implementation detail.)

As it turns out, languages are not always quite so simple. So sometimes -- the case of unary operators is a classic example -- it's useful to explicitly provide a reduction precedence which is different from the default. (Another case is where the precedence is more related to the first terminal than the last, in the case where there are more than one.)

Editorial note: Really, this is all a hack. It's a good hack, and it can be useful. But like all hacks, it can be pushed too far. Intricate tricks with precedence which require a full understanding of the algorithm and a detailed analysis of the grammar are not, IMHO, elegant. They are confusing. The whole point of using a context-free-grammar formalism and a parser generator is to simplify the presentation of the grammar and make it easier to verify. /Editorial note.

An operator precedence grammar is an operator grammar which can be bottom-up parsed using only precedence relations (using an algorithm such as the classic "shunting-yard" algorithm). An operator grammar is a grammar in which no right-hand side has two consecutive non-terminals. And the production:

expression: expression expression

cannot be expressed in an operator grammar.

In that production, the shift reduce conflict comes in the middle, just before where the operator would be if there were an operator. In that case, one would want to compare the precedence of whichever reduction gave rise to the first expression with the invisible operator which separates the expressions.

In some circumstances (and this requires careful grammar analysis, and is consequently very fragile), it's possible to distinguish between terminals which could start an expression and terminals which could be operators. In that case, it would be possible to use the precedence of the terminals in the FIRST set of expression as the comparators in the precedence comparison. Since those terminals will never be used as the comparators in an operator production, no additional ambiguity is created.

Of course, that fails as soon as it is possible for a terminal to be either an infix or a prefix operator, such as unary minus. So it's probably only of theoretical interest in most languages.

In summary, I personally think that the solution of explicitly defining non-application expressions is clear, elegant and consistent with the theory of LR parsing, while any attempt to use precedence relations will turn out to be far less easy to understand and verify.

But, if you insist, here is a grammar which will work in this particular case (without unary operators), based on assigning precedence values to the tokens which might start an expression:

%token IDENTIFIER CONSTANT  APPLY
%left '(' ')' '\\' IDENTIFIER CONSTANT APPLY
%left '+'
%left '*'

%%
expression: '(' expression ')' 
          | expression expression %prec APPLY
          | '\\' IDENTIFIER "->" expression
          | expression '+' expression
          | expression '*' expression
          | IDENTIFIER | CONSTANT
          ;