4
votes

I am learning about recursive decent parsing and come up with some scenarios in which I think the algorithm isn't working. One of them is, considering this simple grammar:

S → E;
E → id | id + id

Then the string id + id; is valid in the language of this grammar. However if we execute the recursive descent algorithm, it descends from S to E then to id, which is the first matching terminal. Now the input is at the + and we are back at S trying to match ; which then fails; but there is no other rule to choose from at the level of S.

I think the grammar is not ambiguous as there are only 2 strings in the language id; and id + id;, each of which has a unique parse tree. The general problem here is that a non-terminal has productions with same prefixes and potentially makes a choice that matches at a deeper level in the recursion but creates invalid inputs for shallower levels.

I've read about typical problems with recursive descent like left-recursion but haven't found the problem above mentioned anywhere. So is this really a problem or am I missing something?


I found an authoritative answer from the book Parsing Techniques: A Practical Guide p.182-188 which classify the above approach as naive recursive decent and highlights the same problem. There are two solutions that always work for general case with no look-ahead (since in general the required look-ahead length increases with the prefix length): Exhaustive Recursive Descent which requires the use of continuations and Breadth-First Recursive Descent.

3

3 Answers

1
votes

I'm so rusty on this that I may be about to post rubbish, but isn't this solvable with lookahead? Something like:

func recogniseS
    expect(E)
    expect(semicolon)

fund recogniseE
    expect(id)

    if nextTokenIs(plus) then 
        expect(plus)
        expect(id)
    endif

Or, similarly, you could reformulate as:

S → id [+ id];

i.e. the essence is simply that the + is optional. So the situation can be dealt with as long as anything optional can be dealt with.

1
votes

It's not a problem in so far as one can factor the grammar like so (where the first alternative for E' is empty):

S → E ;
E → id E'
E' → | + id

For E', we predict the first alternative if the next token is ; and the second if the next token is +.

1
votes

It's a problem in the sense that if you write a PEG grammar like that, it will not work. That is a known problem, occasionally described as a problem with PEG parsing, but I don't find it fair to blame PEG for people writing grammars that it can't handle - other parsing formalisms aren't immune that that either.

If it's not a PEG grammar but a plain old CFG, there should be no problem unless the tool you're using is silly or bugged. It should be able to convert this to a working parser, regardless of whether it uses recursive descend or an other algorithm. If it uses recursive descend, it will probably use lookahead, and that will get rid of this case.