Midrule actions (MRA) force the parser to make early parsing decisions. In this case, for example, the parser needs to execute a MRA before the while
in a do ... while
, but when it sees the while
it is too early to know whether that while
terminates the do
commands or starts a while
command.
Without the MRA there is no problem (probably, depending on the rest of your grammar) because it can keep shifting tokens until it sees either do
or enddo
.
MRAs should be avoided unless absolutely necessary. [Note 1] In most cases where MRAs seem tempting, it turns out that you are trying to do too much inside of the parser. It is often better to limit the parser to producing an Abstract Syntax Tree (AST), or to produce three-address code (TAC) segments in basic blocks, inside of a control-flow graph structure rather than as a monolithic instruction array. [Note 2] These intermediate data structures make basic algorithms (such as filling in branch targets) simpler, and are the basis for a variety of more complicated and extremely useful algorithms which produce faster and smaller code. (Common subexpression elimination, dead code elimination, constant-folding, etc., etc.)
But even if you have decided to go with an approach which seems to benefit from MRAs, you will find that it is often better to avoid them by moving the actions into the non-terminal they follow, or to use an explicit marker non-terminal (that is, an empty non-terminal whose only purpose is to execute an action). These strategies often produce more readable grammars, and in many cases -- including this one -- the reorganization solves reduce-reduce conflicts.
Bison effectively turns MRAs into markers -- you can see that in the grammar report produced with the -v
option -- but true markers have the advantage that they can be used more than once. By contrast, every MRA is distinct (in the bison implementation), even if the actions are character-by-character identical. For example, in the simplified grammar in your question, bison generates nine different marker non-terminals, all with the same action: {std::cout<<"D";}
. As a result, bison ends up complaining of a reduce-reduce conflict because it cannot decide between two different markers which both do the same thing. Obviously, there is no underlying conflict in this case, and replacing the action with an explicit marker would completely avoid the issue without requiring major surgery.
For example, here is a very simplified grammar which (directly) produces three-address code. Note the use of the new-label
marker which inserts a label (and has the label number as its semantic value):
%{
#include <stdarg.h>
#include <stdio.h>
void yyerror(const char*);
int yylex(void);
int pc = 0; /* Program counter */
int label = 0; /* Current label */
int temp = 0; /* Current temporary variable */
void emit_label(int n) { printf("%10s_L%d:\n", "", n); }
void emit_stmt(const char* fmt, ...) {
va_list ap;
va_start(ap, fmt);
printf("/* %3d */\t", pc++);
vprintf(fmt, ap);
putchar('\n');
va_end(ap);
}
%}
%token T_DO "do" T_ENDDO "enddo" T_ENDWHILE "endwhile" T_WHILE "while"
%token ID NUMBER
%%
program
: statements
/* Inserts a label.
* The semantic value is the number of the label.
*/
new-label
: %empty { $$ = label++; emit_label($$); }
/* Parses a series of statements as a block, preceded by a label
* The semantic value is the label preceding the block.
*/
statements
: new-label
| statements statement
statement
: while-statement
| do-statement
| assign-statement
assign-statement
: ID '=' expr { emit_stmt("%c = _t%d", $1, $3); }
while-statement
: new-label "while" condition-jump-if-false "do" statements "endwhile"
{ emit_stmt("JUMP _L%d", $1, 0); emit_label($3); }
do-statement
: "do" statements new-label "while" condition-jump-if-false "enddo"
{ emit_stmt("JUMP _L%d", $2, 0); emit_label($5); }
/* Semantic value is the label which will be branched to if the condition
* evaluates to false.
*/
condition-jump-if-false
: compare { $$ = label++; emit_stmt("IFZ _t%d, _L%d", $1, $$); }
compare
: expr '<' expr { $$ = temp++; emit_stmt("_t%d = _t%d < _t%d", $$, $1, $3); }
expr: term
| expr '+' term { $$ = temp++; emit_stmt("_t%d = _t%d + _t%d", $$, $1, $3); }
term: factor
| term '*' factor { $$ = temp++; emit_stmt("_t%d = _t%d * _t%d", $$, $1, $3); }
factor
: ID { $$ = temp++; emit_stmt("_t%d = %c", $$, $1); }
| NUMBER { $$ = temp++; emit_stmt("_t%d = %d", $$, $1); }
| '(' expr ')' { $$ = $2; }
That code creates more labels than it really needs. The direct-output architecture forces these labels to be printed, but what is really important is that the position in the generated code is saved as the semantic value of a non-terminal (possibly) representing the head of a basic block. Doing that consistently means that final actions have access to the information they need.
Worth noting is the fact that the marker new-label
is used before both instances of while
. In only one case is the label it creates actually needed, but it is not possible to know which production will eventually succeed.
The above code is not entirely satisfactory, for a variety of reasons. For a start, since it insists on writing out every line immediately, it is impossible to insert a placeholder for a jump statement. Consequently, the marker which inserts conditional jumps always jumps forwards (that is, it compiles a jump to a label which has not yet been defined) with the result that the end-test do while construct ends up with code like (source: ... do ... while a < 3 enddo
)
_L4:
/* ... Loop body omitted */
/* 23 */ _t16 = a
/* 24 */ _t17 = 3
/* 25 */ _t18 = _t16 < _t17
/* 26 */ IFZ _t18, _L6
/* 27 */ JUMP _L4
_L6:
instead of the slightly more efficient
_L4:
/* ... Loop body omitted */
/* 23 */ _t16 = a
/* 24 */ _t17 = 3
/* 25 */ _t18 = _t16 < _t17
/* 26 */ IFNZ _t18, _L4
That could be fixed by storing the TAC in an array rather than printing it out, and then backpatching labels into branches. (That change doesn't really affect the grammar, though, because it is all done in final actions.) But it would be harder to implement the classic pre-test optimization which turns:
_L1:
/* 2 */ _t1 = a
/* 3 */ _t2 = 0
/* 4 */ _t3 = _t1 < _t2
/* 5 */ IFZ _t3, _L2
/* ... Loop body omitted */
/* 14 */ JUMP _L1
into
_L1:
/* 2 */ JUMP _L2
/* ... Loop body omitted */
_L2:
/* 12 */ _t1 = a
/* 13 */ _t2 = 0
/* 14 */ _t3 = _t1 < _t2
/* 15 */ IFNZ _t3, _L1
(Reordering basic blocks can often save branches; in general, it's easier to output basic blocks in optimal order than to construct them in textual order and then move them around.)
Notes
MRAs certainly should not be used to attempt to trace the parser because (as in this case) they essentially change the nature of the parse. If you want to trace your parser, follow the steps in the bison manual's section on tracing parses (and read the rest of the chapter on debugging parsers.)
The style of producing TAC via print statements goes back to those early days of computing when memory was so expensive and so limited that compiling had to be done in multiple stages, each one writing its result out to "external storage" (eg. paper tape) so that it could be sequentially read for the next pass. The vast majority of actual programmers were not yet born when this style of writing compilers stopped being necessary, but a surprising number of teaching resources still start, consciously or not, with this basic architecture. The browser you are using to read this answer does not hesitate to use two gigabytes of (virtual memory) to display it. In that context, it's seems silly to worry about using a couple of hundred kilobytes of temporary storage to hold an AST while compiling a program on the same computer.