Generating lex matching rules and yacc grammar rules from an XML DTD

Question

Overview

Although this question implicates lex/yacc, which are written in C, it's fundamentally centered around programming in python.

I have several very similar DTDs that I'm using to parse a document. That section of the program is written in C, and there's just no need to invoke a full SAX handler (viz., libxml2) for this purpose. Since the DTDs (and therefore the XML files) have a static format, I think that this problem can best be solved with lex and yacc.

While writing a full lexical parser for any XML document is far too complex, writing one for a specific subset of XML documents is entirely manageable. The DTD could be used to generate the lexical analyzer (which tokenizes the input) as well as the parser generator in YACC.

There are two assumptions I am willing to make:

The XML document is well-formed vis-à-vis REC-xml-19980210
The XML document is valid vis-à-vis its DTD

Therefore, if an XML document fails to satisfy any of the above, the lexical analyzer/parser should simply fail for that particular file.

Questions

My ultimate goal is to write a python script that successfully: (1) parses the DTD; and (2) generates the lex/yacc files. Before I begin, I have several questions:

Has this problem already been solved?
- If so, are there any libraries that I should consider looking at?
- If not, is it because there is no solution using the tools I've mentioned?
Are there better (as measured by performance) ways to extract the non-markup 'content' from XML files than using a static parser?

I realize that I can use PLY to parse the DTD, but because I'm interested in generating the lex/yacc files for inclusion in a C program, that option will not work. As such, I'm thinking that I might use xml.parsers.expat to parse the DTD. This allows me to register callbacks that track the element names, their position within the tree, whether they're required, etc. This should offer me enough information to generate lex/yacc files, but I would like to see what advice you guys have.

Paul Sweatte Paul Sweatte · Accepted Answer · 2012-07-26T20:22:45

Use a combination of the XML Lexer, the yacc grammar, and the YAXX extension to generate the respective files.

Generating lex matching rules and yacc grammar rules from an XML DTD

Overview

Questions

1 Answers