Parsing a grammar with Boost Spirit

Question

I am trying to parse a C-function like tree expressions like the following (using the Spirit Parser Framework):

F( A() , B( GREAT( SOME , NOT ) ) , C( YES ) )

For this I am trying to use the three rules on the following grammar:

template< typename Iterator , typename ExpressionAST >
struct InputGrammar : qi::grammar<Iterator, ExpressionAST(), space_type> {

    InputGrammar() : InputGrammar::base_type( ) {
       tag = ( qi::char_("a-zA-Z_")  >> *qi::char_("a-zA-Z_0-9") )[ push_back( at_c<0>(qi::_val) , qi::_1 ) ];
       command =  tag [ at_c<0>(qi::_val) = at_c<0>(qi::_1) ] >> "(" >> (*instruction >> ",")
                                        [ push_back( at_c<1>(qi::_val) , qi::_1 ) ]  >> ")";
       instruction = ( command | tag ) [qi::_val = qi::_1];
    }
    qi::rule< Iterator , ExpressionAST() , space_type > tag;
    qi::rule< Iterator , ExpressionAST() , space_type > command;
    qi::rule< Iterator , ExpressionAST() , space_type > instruction;
};

Notice that my tag rule just tries to capture the identifiers used in the expressions (the 'function' names). Also notice that the signature of the tag rule returns a ExpressionAST instead of a std::string, like in most examples. The reason I want to do it like this is actually pretty simple: I hate using variants and I will avoid them if possible. It would be great to keep the cake and eat it too I guess.

A command should start with a tag (the name of the current node, first string field of the AST node) and a variable number of arguments enclosed by parentheses, and each of the arguments can be a tag itself or another command.

However, this example does not work at all. It compiles and everything, but at run time it fails to parse all my test strings. And the thing that really annoys me is that I can't figure how to fix it, since I can't really debug the above code, at least in the traditional meaning of the word. Basically the only way I see I can fix the above code is by knowing what I am doing wrong.

So, the question is that I don't know what is wrong with the above code. How would you define the above grammar?

The ExpressionAST type I am using is:

struct MockExpressionNode {
    std::string name;
    std::vector< MockExpressionNode > operands;

    typedef std::vector< MockExpressionNode >::iterator iterator;
    typedef std::vector< MockExpressionNode >::const_iterator const_iterator;

    iterator begin() { return operands.begin(); }
    const_iterator begin() const { return operands.begin(); }
    iterator end() { return operands.end(); }
    const_iterator end() const { return operands.end(); }

    bool is_leaf() const {
        return ( operands.begin() == operands.end() );
    }
};

BOOST_FUSION_ADAPT_STRUCT(
    MockExpressionNode,
    (std::string, name)
    (std::vector<MockExpressionNode>, operands)
)

Something that I found out recently is that C and C++ identifiers can have '$' characters in their names. So that a-z, A-Z, 0-9 (except for first character), _ and $ are valid in a C/C++ identifier. — Cthutu
@Cthutu MSVC allows accented characters in identifiers. Doesn't mean it's standard compliant. — Etienne de Martel
More importantly, what is the point you're trying to make @Cthutu? Is there a shortage in identifiers? Does your compiler not support namespaces correctly? — sehe

academicRobot academicRobot · Accepted Answer · 2010-06-20T04:20:41

As far as debugging, its possible to use a normal break and watch approach. This is made difficult by how you've formatted the rules though. If you format per the spirit examples (~one parser per line, one phoenix statement per line), break points will be much more informative.

Your data structure doesn't have a way to distinguish A() from SOME in that they are both leaves (let me know if I'm missing something). From your variant comment, I don't think this was your intention, so to distinguish these two cases, I added a bool commandFlag member variable to MockExpressionNode (true for A() and false for SOME), with a corresponding fusion adapter line.

For the code specifically, you need to pass the start rule to the base constructor, i.e.:

InputGrammar() : InputGrammar::base_type(instruction) {...}

This is the entry point in the grammar, and is why you were not getting any data parsed. I'm surprised it compiled without it, I thought that the grammar type was required to match the type of the first rule. Even so, this is a convenient convention to follow.

For the tag rule, there are actually two parsers qi::char_("a-zA-Z_"), which is _1 with type char and *qi::char_("a-zA-Z_0-9") which is _2 with type (basically) vector<char>. Its not possible to coerce these into a string without autorules, But it can be done by attaching a rule to each parsed char:

tag =   qi::char_("a-zA-Z_")
        [ at_c<0>(qi::_val) = qi::_1 ];
    >> *qi::char_("a-zA-Z_0-9")           //[] has precedence over *, so _1 is 
        [ at_c<0>(qi::_val) += qi::_1 ];  //  a char rather than a vector<char>

However, its much cleaner to let spirit do this conversion. So define a new rule:

qi::rule< Iterator , std::string(void) , ascii::space_type > identifier;
identifier %= qi::char_("a-zA-Z_") >> *qi::char_("a-zA-Z_0-9");

And don't worry about it ;). Then tag becomes

tag = identifier
      [
          at_c<0>(qi::_val) = qi::_1,
          ph::at_c<2>(qi::_val) = false //commandFlag
      ]

For command, the first part is fine, but theres a couple problems with (*instruction >> ",")[ push_back( at_c<1>(qi::_val) , qi::_1 ) ]. This will parse zero or multiple instruction rules followed by a ",". It also attempts to push_back a vector<MockExpressionNode> (not sure why this compiled either, maybe not instantiated because of the missing start rule?). I think you want the following (with the identifier modification):

command =
        identifier
        [
           ph::at_c<0>(qi::_val) = qi::_1, 
           ph::at_c<2>(qi::_val) = true    //commandFlag
        ]
    >>  "("
    >> -(instruction % ",")
        [
           ph::at_c<1>(qi::_val) = qi::_1
        ]
    >>  ")";

This uses the optional operator - and the list operator %, the latter is equivalent to instruction >> *("," >> instruction). The phoenix expression then just assigns the vector directly to the structure member, but you could also attach the action directly to the instruction match and use push_back.

The instruction rule is fine, I'll just mention that it is equivalent to instruction %= (command|tag).

One last thing, if there actually is no distinction between A() and SOME (i.e. your original structure with no commandFlag), you can write this parser using only autorules:

template< typename Iterator , typename ExpressionAST >
struct InputGrammar : qi::grammar<Iterator, ExpressionAST(), ascii::space_type> {
   InputGrammar() : InputGrammar::base_type( command ) {
      identifier %=
             qi::char_("a-zA-Z_")
         >> *qi::char_("a-zA-Z_0-9");
      command %=
            identifier
         >> -(
            "("
         >> -(command % ",")
         >>  ")");
    }
    qi::rule< Iterator , std::string(void) , ascii::space_type > identifier;
    qi::rule< Iterator , ExpressionAST(void) , ascii::space_type > command;
};

This is the big benefit of using a fusion wrapped structure that models the input closely.

Parsing a grammar with Boost Spirit

1 Answers