I'm using some parser and lexer generating tools (similar to Lex and Bison, but for C#) to generate programs that parse strings into abstract syntax trees that can later be evaluated.
I wanted to do error recovery (i.e. report in the produced abstract sentence tree that there are missing tokens and such). I had two approaches in mind to structuring the generated grammars, and I was wondering which approach was better/more flexible/wouldn't have conflicts (the .y and .lex files are generated based on a description of the calculator).
The calculator description allows the user to specify terminals/regex's and their placement for operators and the associativity. So something like:
grammar.AddTerminal("Plus", "\\+").
AddNonTerminal(new NonTerminal("Add", Associativity.LeftToRight).
AddTerminal("Expression").
AddTerminal("Plus").
AddTerminal("Expression"));
(Precedence is specified via the order that the Terminal
's and NonTerminal
's are added. "Add"
is the name of a method that is discovered via Reflection. Basically it tells the NonTerminal what to call the operator in the abstract syntax tree.)
Approach 1: (allow the empty rule for any expression)
S -> E
E -> E + T
E -> T
T -> T * P
T -> P
P -> (E)
P -> (E [error]
P -> a
P -> @ [error]
a
is a terminal.
@
is empty.
Approach 2: (only allow the empty rule for the start rule)
S -> E
S -> @ [error]
E -> + [error]
E -> T + [error]
E -> + T [error]
E -> E + T
E -> T
T -> * [error]
T -> * P [error]
T -> P * [error]
T -> T * P
T -> P
P -> (E)
P -> (E [error]
P -> a
Here's an example to show a left-most derivation for a bad-input using each approach.
Input: (a +
Approach 1:
S
E
T
P
(E
(E + T
(T + T
(P + T
(a + T
(a + P
(a +
Approach 2:
S
E
T
P
(E
(T +
(P +
(a +
Approach 2 is much harder to code for (consider subtraction/unary negative operator. You can't just look at subtract A -> A - B
, take out that first A
and report an error on A -> - B
because that's valid for the unary operator.) I coded for approach 2 this morning only to find out that I think it has grammar issues and that an empty rule as in Approach 1 makes things much simpler, code-wise, but my main concern is which approach would produce the least amount of grammar issues as programmers create calculator descriptions as described above.