4
votes

I would like to create simple xml parser using bison/flex. I don't need validation, comments, arguments, only <tag>value</tag>, where value can be number, string or other <tag>value</tag>.

So for example:

<div>
  <mul>
    <num>20</num>
    <add>
      <num>1</num>
      <num>5</num>
    </add>
  </mul>
  <id>test</id>
</div>

If it helps, I know the names of all tags that may occur. I know how many sub-tag can be hold by given tag. Is it possible to create bison parser that would do something like that:

- new Tag("num", 1)           // tag1
- new Tag("num", 5)           // tag2
- new Tag("add", tag1, tag2)  // tag3
- new Tag("num", 20)          // tag4
- new Tag("mul", tag4, tag3)
...
- root = top_tag

Tag & number of sub-tags:

  • num: 1 (only value)
  • str: 1 (only value)
  • add | sub | mul | div: 2 (num | str | tag, num | str | tag)

Could you help me with grammar to be able to create AST like given above?

2
Is it important that you use a subset of XML for this? If your language consists simply of arithmetic expressions, I'd advise you to look instead at parsing infix expressions, which will let you use strings like (20 * (1 + 5)) / test. Unless it's required for some other reason, XML seems a bit like overkill, especially if you're writing the parser!anton.burger
Data are kept in XML so I have no choice.user360872
Apologies for the late response; are you any closer to a solution for this? Given that you have to use XML, why not use an already-written library? Do you have a choice of programming language? Or is the whole point of the exercise to write a parser? And if so, do you have to use a parser generator? For a relatively simple grammar like this, you could easily write a recursive-descent parser.anton.burger
@shambulator, Yes, it's a goal to write xml parser without using libraries. Since I didn't get any advice how to do it, I decided to experiment and it proved to be very simple to write xml grammar in bison.user360872
This sounds too weird. Why would you want to use bison for this? it sounds like a wrong tool or a bad (or at least awkward) choice of a learning projectuser34537

2 Answers

5
votes

For your requirements, I think the yax system would work well. From the README:

The goal of the yax project is to allow the use of YACC (Gnu Bison actually) to parse/process XML documents.

The key piece of software for achieving the above goal is to provide a library that can produce an XML lexical token stream from an XML document.

This stream can be wrapped to create an instance of yylex() to feed tokens to a Bison grammar to parse and process the XML document.

Using the stream plus a Bison grammar, it is possible to carry at least the following kinds of activities.

  1. Validate XML documents,
  2. Directly parse XML documents to create internal data structures,
  3. Construct DOM trees.
2
votes

I do not think that it's the best tool to use to create a xml parser. If I have to do this job, I'll do it by hand.

Flex code will contains : NUM match integer in this example. STR match match any string which does not contains a '<' or '>'. STOP match all closing tags. START match starting tags.

<\?.*\?> { ;} 
<[a-z]+> { return START; }
</[a-z]+> { return STOP; }
[0-9]+ { return NUM; }
[^><]+ { return STR; }

Bison code will look like

%token START, STOP, STR, NUM
%%
simple_xml : START value STOP
;
value : simple_xml 
| STR
| NUM
| value simple_xml
;