Overview
Although this question implicates lex/yacc, which are written in C, it's fundamentally centered around programming in python.
I have several very similar DTDs that I'm using to parse a document. That section of the program is written in C, and there's just no need to invoke a full SAX handler (viz., libxml2) for this purpose. Since the DTDs (and therefore the XML files) have a static format, I think that this problem can best be solved with lex and yacc.
While writing a full lexical parser for any XML document is far too complex, writing one for a specific subset of XML documents is entirely manageable. The DTD could be used to generate the lexical analyzer (which tokenizes the input) as well as the parser generator in YACC.
There are two assumptions I am willing to make:
- The XML document is well-formed vis-à-vis REC-xml-19980210
- The XML document is valid vis-à-vis its DTD
Therefore, if an XML document fails to satisfy any of the above, the lexical analyzer/parser should simply fail for that particular file.
Questions
My ultimate goal is to write a python script that successfully: (1) parses the DTD; and (2) generates the lex/yacc files. Before I begin, I have several questions:
- Has this problem already been solved?
- If so, are there any libraries that I should consider looking at?
- If not, is it because there is no solution using the tools I've mentioned?
- Are there better (as measured by performance) ways to extract the non-markup 'content' from XML files than using a static parser?
I realize that I can use PLY to parse the DTD, but because I'm interested in generating the lex/yacc files for inclusion in a C program, that option will not work. As such, I'm thinking that I might use xml.parsers.expat to parse the DTD. This allows me to register callbacks that track the element names, their position within the tree, whether they're required, etc. This should offer me enough information to generate lex/yacc files, but I would like to see what advice you guys have.