What are the points to consider when building an Abstract Syntax Tree

Question

I am working on a simple toy language for describing UI widgets. I have used bison/flex to implement the grammar. I would now like to create an AST from the grammar rules. However, I am not sure of the "level of granularity" of the AST and what should be included in it. From what I have read, ASTs should be "minimal" and avoid providing redundant information. At the same time obviously I should not loose any information from the original source. Are there specific rules to follow when building ASTs?

The grammar is given below:

%{

#include <string>
#include <iostream>

void yyerror (const char *error);
int  yylex();

extern int yylineno;

%}

%code{
 //   int lineno;
}

%union {
  char* s;
  double d;
  int i;
}

/* declare tokens */


%token APPLICATION GEOMETRY
%token FORM BUTTON LABEL TEXTINPUT
%token ID NAME 
%token WIDTH HEIGHT TEXT ONCLICK
%token ABSOLUTELAYOUT LINEARLAYOUT GRIDLAYOUT
%token ABSOLUTE_LAYOUT_DOT_LEFT ABSOLUTE_LAYOUT_DOT_TOP   
%token EOL ENDOFFILE

%token<s> IDENTIFIER STRING
%token<d> INTEGER

%%


appDef: APPLICATION '{' NAME ':' STRING formdefList '}' {std::cout << "found app" << std::endl;}

iddef : ID ':' IDENTIFIER


formdefList : /* nothing */
    | formdefList formdef
    ;


formdef : FORM  '{' formbodydef '}' 
    ;

formbodydef : /*nothing*/
    | iddef
    | formbodydef layoutdef
    | error                     { 
                                  std::cout << "found error in form body near line: " << yylineno << std::endl;
                                  std::exit(1);  
                                }
    ;

layoutdef : ABSOLUTELAYOUT '{' layoutBody '}' 
    | GRIDLAYOUT '{' layoutBody '}'
    | LINEARLAYOUT '{' layoutBody '}'
    ;

layoutBody : /* nothing */
    | layoutBody layoutdef       /* Layouts can be embedded in one another */
    | layoutBody buttonDef
    | layoutBody labelDef
    | error { std::cout << "Was expecting a child control near line: " << yylineno << std::endl; std::exit(1);}
    ;

geometrydef : GEOMETRY ':' '{' geometrybody '}' { std::cout << "start of geometry def:" << std::endl;}
    ;

geometrybody : /*nothing*/                      { std::cout << "start of geometrybody def" << std::endl;}
    | geometrybody WIDTH ':' INTEGER             
    | geometrybody HEIGHT ':' INTEGER
    | geometrybody ABSOLUTE_LAYOUT_DOT_LEFT ':' INTEGER
    | geometrybody ABSOLUTE_LAYOUT_DOT_TOP ':' INTEGER
    | error { std::cout << "error near line: " << yylineno << std::endl; std::exit(1);}
    ;

buttonDef : BUTTON '{' buttonBody '}'
    ;

buttonBody : /*nothing*/
    | buttonBody iddef
    | buttonBody TEXT ':' STRING
    | buttonBody geometrydef
    ;

labelDef : LABEL '{' labelBody '}'
    ;

labelBody : /*nothing*/
    | labelBody iddef
    | labelBody TEXT ':' STRING
    | labelBody geometrydef
    ;        

%%


void yyerror (const char *error)
{
  std::cout << error << std::endl;
}

Here is some sample input for validating the grammar:

Application{
name: "HelloWorld"
   Form{
      id: MainForm
      LinearLayout{
         Button{
           geometry: {width:20 height:20}
         }
         AbsoluteLayout{
            Button{
               geometry:{
                  width:20
                  height:20
                  AbsoluteLayout.Top:10
                  AbsoluteLayout.Left:20
               }
            }
         }
      }
   }
   Form{
      id: Secondary
   }
}

I am now in the process of building an AST for the grammar and need some advice on its structure:

How should I map the grammar rules to AST nodes? One AST node per terminal symbol?
- How should I design the parent-child relationships between nodes?
- Do the left hand side of grammar rules such as formdef, formbodydef, formdefList explicitly appear in the ASTs?
- What should I include in the nodes? Is it enough to simply include the terminal token type (ID, NAME, etc...) and token value?
- Should I include all token values as strings and perform the actual type conversions such as strings to integers, etc... at a later stage?

Thanks for your advice!

Ira Baxter Ira Baxter · Accepted Answer · 2016-01-29T14:54:43

Nodes need to contain an "abstract" syntax category (something typically close to the LHS of many rules). See https://stackoverflow.com/a/1916687/120163 for a discussion of AST vs CSTs, and why you might want to have nodes contain a concrete syntax category (e.g., exactly the LHS of rules, or the concrete names of leaves). The link discusses when you should keep terminals in your AST.

They should also contain values if they represent something that varies in a way that grammar does not record (e.g, identifier names, numeric and string literal constants, etc.) See https://stackoverflow.com/a/6320259/120163 for why you should convert values when you lex, not later.

Shown here: https://stackoverflow.com/a/17393852/120163 are examples of what ASTs might look like if you print them out (you better build an AST printer to help you debug them).

How much information you need to keep in node depends on what you intend to do with it eventually. If you want to regenerate source from the AST, you need to keep a lot of stuff you might not ever have considered, e.g. radix of converted numbers. See https://stackoverflow.com/a/5834775/120163 for more details.

What are the points to consider when building an Abstract Syntax Tree

1 Answers