LR(1) parse straight to Abstract syntax tree

Question

So I asked a question close this recently and received an very good answer. However the steps that were described seemed more like the steps to create a concrete syntax tree.

Each reduction in the LR parsing process corresponds to an internal node in the parse tree. The rule being reduced is the internal AST node, and the items popped off the stack correspond to the children of that internal node. The item pushed for the goto corresponds to the internal node, while those pushed by shift actions correspond to leaves (tokens) of the AST.

Putting all that together, you can easily build an AST by createing a new internal node every time you do a reduction and wiring everything together appropriately. ~Chris Dodd

I know that the parse tree is implied by the steps taken however I don't want to explicitly build the parse tree. And generating a node every reduction seems like it would result in the entire parse tree. I have considered that I only build a node if a certain state is seen. However this seems like it would not work properly.

rici rici · Accepted Answer · 2015-05-27T02:57:09

You don't need to build a node on every reduction, and the nodes you do build don't need to include every symbol being reduced. Nor do the reduced symbols need to appear in the same order as the parse.

In many cases, an AST is a simplification of the full parse tree, corresponding to the above.

Simple example, for an expression grammar, using a yacc-like parser generator:

expr: term            { $$ = $1; /* see below */ }
    | expr '+' term   { $$ = new_sum_node($1, $3); }
term: factor          { $$ = $1; /* see below */ }
    | term '*' factor { $$ = new_product_node($1, $3); }
factor: '(' expr ')'  { $$ = $2; /* See below */ }
      | ID            { $$ = new_variable_node($1); }
      | NUMBER        { $$ = new_literal_node($1); }

The AST is built as the semantic value of the non-terminals. The functions new_*_node are expected to return a newly-allocated node of the indicated type. (Here we assume that there is some mechanism to deduce from a pointer what type of node it is. For example, it would be possible to use subtyping and virtual functions.)

In Yacc (and similar parser generators), every symbol (terminal or non-terminal) has an associated "value", and every production has an associated action which executes when the production is reduced. The action for a production can assign the "value" of the non-terminal being reduced. The action can make use of the "values" of the components of the right-hand-side, since each one is either a terminal (whose value was set by the scanner) or a non-terminal which has already been reduced. In effect, this is an S-attributed grammar.

Some of the above reductions do not appear at all in the AST. In particular, the unit reductions (expr:term and term:factor) simply pass through the AST for the right-hand side. Similarly, the parenthesis reduction term:'(' expr ')' simply passes through the AST for the expr, with the result that the parentheses effectively vanish from the AST. (That's not correct in all languages; in some languages, apparently-redundant parentheses actually affect the semantics and you'd need to create an AST node to record the fact.)

In Yacc, $$ = $1 is the default reduction action if no action is specified, and since most unit reductions are eliminated from the AST, they will normally be shown without reduction actions. I made them explicit for didactic purposes; in practice, you shouldn't do that.

LR(1) parse straight to Abstract syntax tree

1 Answers