2
votes

I'm using some parser and lexer generating tools (similar to Lex and Bison, but for C#) to generate programs that parse strings into abstract syntax trees that can later be evaluated.

I wanted to do error recovery (i.e. report in the produced abstract sentence tree that there are missing tokens and such). I had two approaches in mind to structuring the generated grammars, and I was wondering which approach was better/more flexible/wouldn't have conflicts (the .y and .lex files are generated based on a description of the calculator).

The calculator description allows the user to specify terminals/regex's and their placement for operators and the associativity. So something like:

grammar.AddTerminal("Plus", "\\+").
    AddNonTerminal(new NonTerminal("Add", Associativity.LeftToRight).
        AddTerminal("Expression").
        AddTerminal("Plus").
        AddTerminal("Expression"));

(Precedence is specified via the order that the Terminal's and NonTerminal's are added. "Add" is the name of a method that is discovered via Reflection. Basically it tells the NonTerminal what to call the operator in the abstract syntax tree.)


Approach 1: (allow the empty rule for any expression)

S -> E
E -> E + T
E -> T
T -> T * P
T -> P
P -> (E)
P -> (E [error]
P -> a
P -> @ [error]

a is a terminal. @ is empty.


Approach 2: (only allow the empty rule for the start rule)

S -> E
S -> @ [error]
E -> + [error]
E -> T + [error]
E -> + T [error]
E -> E + T
E -> T
T -> * [error]
T -> * P [error]
T -> P * [error]
T -> T * P
T -> P
P -> (E)
P -> (E [error]
P -> a

Here's an example to show a left-most derivation for a bad-input using each approach.


Input: (a +


Approach 1:

S
E
T
P
(E
(E + T
(T + T
(P + T
(a + T
(a + P
(a +

Approach 2:

S
E
T
P
(E
(T +
(P +
(a +

Approach 2 is much harder to code for (consider subtraction/unary negative operator. You can't just look at subtract A -> A - B, take out that first A and report an error on A -> - B because that's valid for the unary operator.) I coded for approach 2 this morning only to find out that I think it has grammar issues and that an empty rule as in Approach 1 makes things much simpler, code-wise, but my main concern is which approach would produce the least amount of grammar issues as programmers create calculator descriptions as described above.

1
What does this have to with yacc? yacc has a very specific error recovery scheme based on throwing away tokens to create something parseable. It does not have an easy way of dealing with missing tokens.Chris Dodd
Well the parsing tool I'm using is MPPG, which is a Microsoft branding of GPPG, which is a spin-off of Yacc. Since the syntax is very similar for the .y files, I tagged it as such.Jesus is Lord

1 Answers

3
votes

This question has been around for a while and covers a topic that is is often visited by beginners in the subject. One often find that those who have done a compilers course in their undergraduate degree know that this is one of those questions that has no easy or single answer. You might have noticed that you have two questions on the same topic, neither of which has been answered. Another question someone else posted was answered with pointers to the literature that explains why this is hard.

It is a question that has remained extant for over 50 years. If one examines the literature over time, from early conference papers, course textbooks, doctoral thesis and (today) SO, we can see regular references to the fact that this is the wrong question! (Or rather the wrong approach to the problem).

Just taking a sample from course texts over the years (random selections from my bookshelf):

Gries, D. (1970) Error Recovery and Correction - An introduction to the Literature, In Compiler Construction, An advanced Course edited by Bauer, F.L. & Eickel, J., Springer Verlag, pp.627-638.
Gries, D. (1971) Compiler Construction for Digital Computers, Wiley, pp.320-326.
Aho, A.V., Ullman, J.D. (1977) Principles of Compiler Design, Addison Wesley, pp.397-405.
Bornat, R. (1979) Understanding and Writing Compilers, Macmillan, pp.251-252.
Hanson, D. (1995) A retargetable C compiler: Design and Implementation,Addison-Wesley, pp.140-146.
Grune, D., Bal, H.E., Jacobs, C.J.H. & Langendoen, K.G. (2000) Modern Compiler Design, Wiley, pp.175-184.
Aho, A.V., Lam, M.S., Sethi, R., Ullman, J.D. (2007) Compilers: Principles, Techniques, & Tools, Pearson, Addison-Wesley, pp.283-296.

All of these concur (over a 40 year span) that your question is about using the wrong tools wrongly or going in the wrong direction. I think I'm trying to say "You can't there from here". You should start from somewhere else.

If you want something deeper, there is a whole Phd thesis:

Charles, P. (1991) A Practical method for Constructing Efficient LALR(k) Parsers with Automatic Error Recovery, New York University

Hopefully, for those who visit this question again in the future, there is a placeholder for the answers.


Managed Babel Systems Essentials
Garden Points Parser Generator
Irony .NET compiler Construction Kit
Writing your first Visual Studio Language Service