3
votes

I am writing a recursive descent parser for config files. These are mostly similar to ini files. Here is the language in some sort of EBNF-like form:

document     ::= { category }
category     ::= title {entry}
title        ::= "[" <name> "]"
entry        ::= <key> ":" <value>

Here is an example of a file that should give a parsing error at the end:

[Category1]
key1:val1
key2 :val2
key3 : val3

[Category2]
key4: val4

this line right here should produce an error

All of the examples I could find online would parse the input until an invalid symbol is reached, then quit without printing a useful error message. I have a working parser that follows this behavior, but I am not sure how to implement useful error reporting.

For example, a document consists of zero or more categories. What do I do when the first two categories are parsed without error but the third contains a syntax error? What if the input ends after the second category and I am unable to parse a third category because there are no tokens left (this should not produce an error message)? How do I differentiate between these situations? The invalid line could be made valid in two ways: becoming an entry or becoming a title. This confuses me.

I would like my program to print something like line 9: expected entry or title when it reaches the last line of the above input. How do people normally implement error messages in recursive descent parsers?

1
I usually do something like:"Parse error line 42: found '%', expected '&'."rossum
making an invalid string valid is a very complex task. First make simple error reporting.Alexei Kaigorodov

1 Answers

4
votes

I think you're asking a couple of big and unrelated questions, I'll try to answer them:

What do I do when the first two categories are parsed without error but the third contains a syntax error?

This of course depends on your requirements. In a strict sense, this would be a validation fail because the document passed in does not actually confirm to your grammar rules, due to the violation that you mentioned. Whether it throws an Error, returns false, returns partial results, all up to your requirements.

What if the input ends after the second category and I am unable to parse a third category because there are no tokens left (this should not produce an error message)? How do I differentiate between these situations?

There is no problem here as you say since it complies with the document symbol. If you're confused about the actual implementation, then that's an implementation detail (for which you didn't provide any of your code).

Perhaps you should try to visualize your EBNF in its more basic BNF form, without the {} extension, here's one example:

document     ::= category
category     ::= title entries category | title entries
title        ::= "[" <name> "]"
entries      ::= entry | entries
entry        ::= <key> ":" <value>

I personally think representing your grammar this way leads to more guidance for where you need the recursion. For example in this case, you will want to attempt to parse category within the category symbol parsing. The structure of your code would follow that more or less - ie, if it can't parse the next symbol as a category, then return true anyways (since the 2nd definition title entries follows)

How do people normally implement error messages in recursive descent parsers?

I have the same question myself, but seeing that I couldn't find any answers on SO, I will be implementing it the following way:

  1. As I parse and eat tokens, I will store the length consumed.
  2. When an error is reached, don't immediately throw it - store it alongside with the amount of tokens that have been parsed.
  3. At the end of the parse, including backtracking, if there are no parseable solutions, throw the error that consumed the most symbols

The key is #2. While implementing a recursive descent parser with backtracking, multiple false positives can be thrown. I think in terms of usability, its a generally decent heuristic to simply throw the one that parsed the furthest.