12
votes

Sorry in advance; I am sure this question will seem almost idiotic to people who are used to playing with parsers and grammars, but these are foreign topics to me, and this is my attempt at stepping gently into a practical case requiring them.

I would like to write a parser for the following "language", which contains a single "special structure" looking like this:

\command[ options ]{ contents }

The contents can be anything, including nested commands, and may contain escaped brackets or backslashes \{ \} \\. I realise that "anything" is not specific, but ideally they should be determined by matching brackets (excluding escaped ones), if possible.

The options should be a comma-separated list of assignment expressions such as name = value, but the value may be a quoted string containing = or , characters. Finally the previous name and command should validate the regular expression \w[\w\d\._-+*]* -- that is, the first character should be a letter, and the remaining characters should be a letter, digit or one of . _ - + *.

Writing this with regular expressions seems overly complicated (for instance, because values may contain quoted characters , =, which would otherwise separate assignments or name/value pairs). So I think the most appropriate tool here is a grammar, but despite superficial readings, I am just not sure how to write it (BNF, PEG, etc?), which type of parsers to use (LR, recursive decent, etc?), and how I could use the parsing output in a practical program.

I would prefer answers with Python, which explains the tag, but of course I would be perfectly happy with a combination of tools if necessary/better suited.


NOTE: this is NOT about LaTeX. I realise the similarity of course, but LaTeX is extremely more complex than the previous language, for instance with character codes varying depending on the context. I am merely asking for a practical example which (I think) is simple enough for SO, and yet would already be useful to me in my daily work.

1
Is this (La)TeX?Willem Van Onsem
No :) LaTeX is much more complicated, with character codes, @-statements, etc. In a way this is a very very strong restriction of LaTeX. I am asking mainly because I want to learn on a case which can already be useful at work, and which (I think) is simple enough for an answer on SO.Jonathan H
So refreshing to read a question with parsing in the title that's actually about parsing.Jared Smith
"The contents can be anything" doesn't give much to go on from the point of view of writing a parser.John Coleman
@JohnColeman What I mean is that the contents should be determined by matching bracket pairs, but ignoring escaped brackets. Does it need to be more specific for a grammar?Jonathan H

1 Answers

7
votes

Start by expressing your grammar more formally, in whatever notation you prefer. e.g., from your description, an EBNF would be like this:

program := element+
element := command | literal
literal := (not '\')+

command := '\'identifier options? '{' program '}'
options := option | options ',' option
option  := identifier '=' value
value   := number | string

string  := '"' (escape | not '\' or '"')* '"'
escape  : = '\' char

Then either feed this to a parser generator (pyParsing, pyYACC, ANTLR) or write a parser by hand. In the latter case top-down is the simplest option: start from the top of the grammar and convert each rule to a function which will either return a parsed AST node and consume the input or return nothing or throw. Example:

 def program():
    elements = []
    while next_sym():
        elements.append(element())
    return {'type': 'program', 'children': elements}

 def element():
     return command() or literal()

 def command():
     if next_sym() == '\\':
         get_sym()
         ...parse command here
         return {'type': 'command', 'children': ...}
     return None 

where next_sym returns the next symbol from the input (or None on EOF) and get_sym consumes the symbol and advances the input buffer.