1
votes

I'm reading a "Compiler Construction, Principles and Practice" book by Kenneth Louden and trying to understand error recovery in Yacc.

The author is giving an example using the following grammar:

%{
#include <stdio.h>
#include <ctype.h>

int yylex();
int yyerror();
%}

%%

command : exp { printf("%d\n", $1); }
        ; /* allows printing of the result */

exp : exp '+' term { $$ = $1 + $3; }
    | exp '-' term { $$ = $1 - $3; }
    | term { $$ = $1; }
    ;

term : term '*' factor { $$ = $1 * $3; }
     | factor { $$ = $1; }
     ;

factor : NUMBER { $$ = $1; }
       | '(' exp ')' { $$ = $2; }
       ;

%%

int main() {
  return yyparse();
}

int yylex() {
  int c;

  /* eliminate blanks*/
  while((c = getchar()) == ' ');

  if (isdigit(c)) {
    ungetc(c, stdin);
    scanf("%d\n", &yylval);
    return (NUMBER);
  }

  /* makes the parse stop */
  if (c == '\n') return 0;

  return (c);
}

int yyerror(char * s) {
  fprintf(stderr, "%s\n", s);
  return 0;
} /* allows for printing of an error message */

Which produces the following state table (referred to as table 5.11 later on)

enter image description here

Numbers in the reductions correspond to the following productions:

(1) command : exp.
(2) exp : exp + term.
(3) exp : exp - term.
(4) exp : term.
(5) term : term * factor.
(6) term : factor.
(7) factor : NUMBER.
(8) factor : ( exp ).

Then Dr. Louden gives the following example:

Consider what would hapen if an error production were added to the yacc definition

factor : NUMBER {$$ = $1;}
       | '(' exp ')' {$$=$2;}
       | error {$$ = 0;}
       ;

Consider first erroneous input 2++3 as in the previous example (We continue to use Table 5.11, although the additional error production results in a slightly different table.) As before the parser will reach the following point:

parsing stack         input
$0 exp 2 + 7          +3$

Now the error production for factor will provide that error is a legal lookahead in state 7 and error will be immediately shifted onto the stack and reduced to factor, causing the value 0 to be returned. Now the parser has reached the following point:

parsing stack                 input
$0 exp 2 + 7 factor 4         +3$

This is a normal situation, and the parser will continue to execute normally to the end. The effect is to interpret the input as 2+0+3 - the 0 between the two + symbols is there because that is where the error pseudotoken is inserted, and by the action for the error production, error is viewed as equivalent to a factor with value 0.

My question is very simple:

How did he know by looking at the grammar that in order to recover from this specific error (2++3) he needs to add an error pseudotoken to the factor production? Is there a simple way to do it? Or the only way to do it is to work out multiple examples with the state table and recognize that this particular error will occur on this given state and therefore and if I add an error pseudotoken to a some specific production the error will be fixed.

Any help is appreciated.

1

1 Answers

2
votes

In that simple grammar, you have very few options for an error production, and all of them will allow the parse to continue.

Choosing the one at the bottom of the derivation tree makes some sense in this case, but that's not a general purpose heuristic. It's more commonly useful to put error productions at the top of the derivation tree where they can be used to resynchronize the parse. For example, suppose we'd modified the grammar to allow for multiple expressions, each on its own line: (which would require modifying yylex so that it doesn't fake an EOF when it sees \n):

program: %empty
       | program '\n'
       | program exp '\n'  { printf("%g\n", $1); }

Now, if we want to just ignore errors and continue parsing, we can add a resynchronizing error production:

       | program error '\n'

The '\n' terminal in the above will cause tokens to be skipped until a newline can be shifted to reduce the error production, so that the parse can continue with the next line.

Not all languages are so easy to resynchronize, though. Statements in C-like languages are not necessarily terminated by ;, and a naive attempt to resynchronize as above would cause a certain amount of confusion if the error were, for example, a missing }. However, it would allow the parse to continue in some way, and that might be sufficient.

In my experience, getting error productions right usually requires a lot of trial and error; it is much more of an art than a science. Trying a lot of erroneous inputs and analysing the error recovery will help.

The point of an error production is to recover from an error. Producing good error messages is an unrelated but equally challenging problem. By the time the parser attempts error recovery, the error message has already been sent to yyerror. (Of course, that function could ignore the error message and leave it to the error production to print an error, but there's no obvious reason to do that.)

One possible strategy for producing good error messages is to do some kind of table lookup (or computation) on the parser stack and the lookahead token. In effect, that's what bison's builtin expanded error handling does, and that often produces pretty reasonable results, so it's a good starting place. Alternative strategies have been explored. One good reference is Clinton Jeffrey's 2003 paper Generating LR Syntax Error Messages from Examples; you might also check out Russ Cox's explanation of how he applied that idea to a Go compiler.