How should I construct and walk the AST output of an ANTLR3 grammar?

Question

Documentation and general advice is that the abstract syntax tree should omit tokens that have no meaning. ("Record the meaningful input tokens (and only the meaningful tokens" - The Definitive ANTLR Reference) IE: In a C++ AST, you would omit the braces at the start and end of the class, as they have no meaning and are simply a mechanism for delineating class start and end for parsing purposes. I understand that, for the sake of rapidly and efficiently walking the tree, culling out useless token nodes like that is useful, but in order to appropriately colorize code, I would need that information, even if it doesn't contribute to the meaning of the code. A) Is there any reason why I shouldn't have the AST serve multiple purposes and choose not to omit said tokens?

It seems to me that what ANTLRWorks interpreter outputs is what I'm looking for. In the ANTLRWorks interpreter, it outputs a tree diagram where, for each rule matched, a node is made, as well as a child node for each token and/or subrule. The parse tree, I suppose it's called.

If manually walking a tree, wouldn't it be more useful to have nodes marking a rule? By having a node marking a rule, with it's subrules and tokens as children, a manual walker doesn't need to look ahead several nodes to know the context of what node it's on. Tree grammars seem redundant to me. Given a tree of AST nodes, the tree grammar "parses" the nodes all over again in order to produce some other output. B) Given that the parser grammar was responsible for generating correctly formed ASTs and given the inclusion of rule AST nodes, shouldn't a manual walker avoid the redundant AST node pattern matching of a tree grammar?

I fear I'm wildly misunderstanding the tree grammar mechanism's purpose. A tree grammar more or less defines a set of methods that will run through a tree, look for a pattern of nodes that matches the tree grammar rule, and execute some action based on that. I can't depend on forming my AST output based on what's neat and tidy for the tree grammar (omitting meaningless tokens for pattern matching speed) yet use the AST for color-coding even the meaningless tokens. I'm writing an IDE as well; I also can't write every possible AST node pattern that a plugin author might want to match, nor do I want to require them using ANTLR to write a tree grammar. In the case of plugin authors walking the tree for their own criteria, rule nodes would be rather useful to avoid needing pattern matching.

Thoughts? I know this "question" might push the limits of being an SO question, but I'm not sure how else to formulate my inquiries or where else to inquire.

Good question. Note that I took the liberty to change the tag antlr3 to antlr since your question is not just applicable to the 3rd version of ANTLR. I would have just added the tag, but there can only be a total of 5 tags added to a question. — Bart Kiers
@Sion: You had asked a question earlier about having a class B { std::vector<B> x; }; or some similar situation involving a std::vector of an incomplete type. You've deleted that question, saying that it works. It doesn't really work: you cannot instantiate a Standard Library container with an incomplete type. It may compile but the results are undefined. — James McNellis

Bart Kiers Bart Kiers · Accepted Answer · 2011-05-13T08:33:23

Sion Sheevok wrote:

A) Is there any reason why I shouldn't have the AST serve multiple purposes and choose not to omit said tokens?

No, you mind as well keep them in there.

Sion Sheevok wrote:

It seems to me that what ANTLRWorks interpreter outputs is what I'm looking for. In the ANTLRWorks interpreter, it outputs a tree diagram where, for each rule matched, a node is made, as well as a child node for each token and/or subrule. The parse tree, I suppose it's called.

Correct.

Sion Sheevok wrote:

B) Given that the parser grammar was responsible for generating correctly formed ASTs and given the inclusion of rule AST nodes, shouldn't a manual walker avoid the redundant AST node pattern matching of a tree grammar?

Tree grammars are often used to mix custom code in to evaluate/interpret the input source. If you mix this code inside the parser grammar and there's some backtracking going on in the parser, this custom code could be executed more than it's supposed to. Walking the tree using a tree grammar is (if done properly) only possible in one way causing custom code to be executed just once.

But if a separate tree walker/iterator is necessary, there are two camps out there that advocate using tree grammars and others opt for walking the tree manually with a custom iterator. Both camps raise valid points about their preferred way of walking the AST. So there's no clear-cut way to do it in one specific way.

Sion Sheevok wrote:

Thoughts?

Since you're not evaluating/interpreting, you mind as well not use a tree grammar.

But to create a parse tree as ANTLRWorks does (which you have no access to, btw), you will need to mix AST rewrite rules inside your parser grammar though. Here's a Q&A that explains how to do that: How to output the AST built using ANTLR?

Good luck!

How should I construct and walk the AST output of an ANTLR3 grammar?

1 Answers