You don't need to build a node on every reduction, and the nodes you do build don't need to include every symbol being reduced. Nor do the reduced symbols need to appear in the same order as the parse.
In many cases, an AST is a simplification of the full parse tree, corresponding to the above.
Simple example, for an expression grammar, using a yacc-like parser generator:
expr: term { $$ = $1; /* see below */ }
| expr '+' term { $$ = new_sum_node($1, $3); }
term: factor { $$ = $1; /* see below */ }
| term '*' factor { $$ = new_product_node($1, $3); }
factor: '(' expr ')' { $$ = $2; /* See below */ }
| ID { $$ = new_variable_node($1); }
| NUMBER { $$ = new_literal_node($1); }
The AST is built as the semantic value of the non-terminals. The functions new_*_node
are expected to return a newly-allocated node of the indicated type. (Here we assume that there is some mechanism to deduce from a pointer what type of node it is. For example, it would be possible to use subtyping and virtual functions.)
In Yacc (and similar parser generators), every symbol (terminal or non-terminal) has an associated "value", and every production has an associated action which executes when the production is reduced. The action for a production can assign the "value" of the non-terminal being reduced. The action can make use of the "values" of the components of the right-hand-side, since each one is either a terminal (whose value was set by the scanner) or a non-terminal which has already been reduced. In effect, this is an S-attributed grammar.
Some of the above reductions do not appear at all in the AST. In particular, the unit reductions (expr:term
and term:factor
) simply pass through the AST for the right-hand side. Similarly, the parenthesis reduction term:'(' expr ')'
simply passes through the AST for the expr
, with the result that the parentheses effectively vanish from the AST. (That's not correct in all languages; in some languages, apparently-redundant parentheses actually affect the semantics and you'd need to create an AST node to record the fact.)
In Yacc, $$ = $1
is the default reduction action if no action is specified, and since most unit reductions are eliminated from the AST, they will normally be shown without reduction actions. I made them explicit for didactic purposes; in practice, you shouldn't do that.