3
votes

What is the proper way of handling C++ code blocks in Xtext/ANTLR?

We are writing an Xtext-based eclipse plugin for a DSL that supports adding C++ function-level code within well-defined scopes (mostly serial { /* ... */ } blocks) such as this:

module m {
  chare c {
    entry void foo() {
      serial {
        // C++ code block
      }
    }
  }
}

See here for a more comprehensive example. This is then handed over to an external tool to handle further compilation/linking steps, so we don't generate any code from eclipse.

The issue here is how to handle these C++ code blocks, especially given that they may contain braces of their own. This is very similar to How to include Java Code Block in Xtext DSL? but for now we are content with just ignoring that block (i.e. not having content assist or syntax highlighting is not ideal but acceptable.)

In our bison/flex-based tool this is done by sharing a variable between the parser and lexer that toggles a "C++ parsing mode" within certain grammar rules that makes the lexer return a CPROGRAM token for everything except the relevant delimiters (e.g. braces.) The natural translation seems to have a custom ANTLR lexer that uses semantic predicates to the same effect, e.g.

RULE_NON_BRACES: {in_braces}? ~('{' | '}')+;

as the first lexing rule, but I cannot find how to access that global variable from the Xtext grammar since there doesn't seem to be a concept of "rule action" as in bison. There are other non-"serial" grammar contexts where C++ code is expected, so there needs to be some coordination between the parser and lexer.

2

2 Answers

3
votes

Your question seems more focused on how the DSL lexer avoids getting lost in C++ code. The basic answer is you need to match parentheses (e.g, ensure that they nest properly).

I don't know how you define an Xtext/ANTLR lexical rule to do that; I presume there is an ugly way to drop down into procedural code and start reading characters one-by-one. This may have some complications; your paren matching logic may have to worry about various types of quoting in the C++ code. For instance,

        {   //   this } isn't a match

and

        {   char x[]="} this isnt a match { either" }

Other C++ string quotes may make this even more difficult to see. What will you do about a C++ macro used like this?

        {
        #define rcb }
            {   rcb
        }

You'll will probably have to make some special rules about how } is processed in the embedded C++ code, and your character-by-character scanning will have to know this rule.

Rather than make this complicated, I think what you should do is pick a really unlikely sequence of characters in C++ as your termination, e.g.,

    ][[

which I believe cannot occur in C++ except in a string or comment, or

    }}}

and simply use that. No need to match parens at all. In almost all cases, the C++ to drop in won't have to be touched; in the rare, rare case where it happens to contain that sequence an trivial edit (insert a space or linebreak) fixes it. Now your lexer rule is simple and can be expressed (I think) using your standard lexer.

If you go this way, I'd suggest you chose a corresponding opening sequence to introduce the C++ code, just to remind the reader that a funny sequence is required, e.g.,

  serial {{{    <C++code>  }}}

or

  serial ]][   <C++code>   ][[

With this convention, even my ugly macro example is easy:

  serial {{{
      {
      #define rcb }
         {  rcb
      } 
  }}}

PS: This funny notational trick is called a "domain (notation) escape". This problem occurs in every system (yes not that many in the wild, but I have one :) that allows one to mix multiple notations. The sequence varies across language/system according to taste.

1
votes

If you really cannot change the syntax and need to rely on matching curly braces, than you need to reimplement your flex-based solution in Java (e.g. use jflex) and make Xtext use that lexer. I have covered that briefly in this blog post. It also contains a pointer to example code where I have used a jflex based lexer in Xtext.