1
votes

So I am making an interpreter for a language which I am making which is similar to Python. Now I understand that this is no small task and I don't expect it to work very well or do much but I would like it to have some basic functionality (variables, functions, loops, if statements, etc...).

So currently I am at the stage where the interpreter takes a file, and splits it up into a list of tokens, and now I am ready to turn these tokens into an AST. I intend to do this with a recursive descent parser, which I believe I understand, but here is the problem. Let's say I have the following input

1 + 2 * 3

this would output 7, because using BIDMAS the multiplication is done first so

2 * 3 = 6

then the addition is done after

1 + 6 = 7

I know how to get this order as I have a simple grammar, but I do not know how to store this as an AST. To simplify things for the answers, lets assume this is the only input you will recieve and the grammar can be

program = add
add = mul {"+" mul}
mul = NUM {"*" NUM}

So basically, how do you make a data structure(s) to store an AST?

P.S. I am doing this in C.

2
You should buy the Compilers "dragon" book from Aho Seti and Ullman...and a tool called bison will help you to achieve what you intend in a fraction of the time you spend to do it "by hand"Leo
unfortunately I cannot use bison as the aim of this project is to do all of it "by hand". thanks anywaydangee1705

2 Answers

4
votes

Disclaimer: This representation is subjective and just meant to illuminate.

Fundamentally, your AST will be constructed like a binary tree where each AST node is a "C" structure that holds both a "left" and "right" pointer. The other elements of the AST are typically context sensitive. For example, a variable declaration versus a function definition or an expression in a function. For the example you cited, the rough tree would mirror this:

   +
 /   \
1     *
      /\
     2  3 

So by substituting the above nodes 1 + (2 * 3) with your AST construct would be similar to:

                 -----------------
                | type: ADDOP   |
                | left: struct* |
                | right: struct*|
                -----------------
              /                   \
             /                     \
 (ADDOP left points to)         (ADDOP right points to)
------------------------       --------------------------  
| type: NUMLITERAL     |       | type: MULTOP           |
| value: 1             |       | left: struct*          |
| left: struct* (null) |       | right: struct*         |
| right: struct*(null) |       --------------------------
------------------------              /               \
                                     /                 \

                    (MULTOP left points to)         (MULTOP right points to)
                    ------------------------       --------------------------  
                    | type: NUMLITERAL     |       | type: NUMLITERAL       |
                    | value: 2             |       | value: 3               |
                    | left: struct* (null) |       | left: struct* (null)   |
                    | right: struct*(null) |       | right: struct* (null)  |
                    ------------------------       --------------------------

I assume that you know enough about "C" and how to malloc nodes and assign the left/right pointers.

Now the remaining activity would be to do a post order traversal of the tree to either evaluate the expression and produce a result or to emit the appropriate intermediate code/machine code that aligns to a compiled result. Either choice bringing with it a massive amount of thinking and planning on your part.

BTW: As noted, the AST nodes are going to typically have attributes based on the level of abstraction you want to represent. Also note that a typical compiler may leverage multiple AST for different reasons. Yep, more reading/studying on your part.

Note: This illustrates the data structure for an AST but @mikeb answer is rock solid for how to get the string "1 + 2 * 3" into the nodes of such a structure.

1
votes

I'd use the "Shunting Yard" algorithm -> https://en.wikipedia.org/wiki/Shunting-yard_algorithm

There is psudocode there too.

FTA:

In computer science, the shunting-yard algorithm is a method for parsing mathematical expressions specified in infix notation. It can be used to produce either a postfix notation string, also known as Reverse Polish notation (RPN), or an abstract syntax tree (AST). The algorithm was invented by Edsger Dijkstra and named the "shunting yard" algorithm because its operation resembles that of a railroad shunting yard. Dijkstra first described the Shunting Yard Algorithm in the Mathematisch Centrum report MR 34/61.

Like the evaluation of RPN, the shunting yard algorithm is stack-based. Infix expressions are the form of mathematical notation most people are used to, for instance "3+4" or "3+4*(2−1)". For the conversion there are two text variables (strings), the input and the output. There is also a stack that holds operators not yet added to the output queue. To convert, the program reads each symbol in order and does something based on that symbol. The result for the above examples would be "3 4 +" or "3 4 2 1 - * +".

The shunting-yard algorithm has been later generalized into operator-precedence parsing.

The code, since it was pointed out that this is not how to store it (and if you don't like C - take your pick from http://rosettacode.org/wiki/Parsing/Shunting-yard_algorithm):

#include <sys/types.h>
#include <regex.h>
#include <stdio.h>

typedef struct {
    const char *s;
    int len, prec, assoc;
} str_tok_t;

typedef struct {
    const char * str;
    int assoc, prec;
    regex_t re;
} pat_t;

enum assoc { A_NONE, A_L, A_R };
pat_t pat_eos = {"", A_NONE, 0};

pat_t pat_ops[] = {
    {"^\\)",    A_NONE, -1},
    {"^\\*\\*", A_R, 3},
    {"^\\^",    A_R, 3},
    {"^\\*",    A_L, 2},
    {"^/",      A_L, 2},
    {"^\\+",    A_L, 1},
    {"^-",      A_L, 1},
    {0}
};

pat_t pat_arg[] = {
    {"^[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?"},
    {"^[a-zA-Z_][a-zA-Z_0-9]*"},
    {"^\\(", A_L, -1},
    {0}
};

str_tok_t stack[256]; /* assume these are big enough */
str_tok_t queue[256];
int l_queue, l_stack;
#define qpush(x) queue[l_queue++] = x
#define spush(x) stack[l_stack++] = x
#define spop()   stack[--l_stack]

void display(const char *s)
{
    int i;
    printf("\033[1;1H\033[JText | %s", s);
    printf("\nStack| ");
    for (i = 0; i < l_stack; i++)
        printf("%.*s ", stack[i].len, stack[i].s); // uses C99 format strings
    printf("\nQueue| ");
    for (i = 0; i < l_queue; i++)
        printf("%.*s ", queue[i].len, queue[i].s);
    puts("\n\n<press enter>");
    getchar();
}

int prec_booster;

#define fail(s1, s2) {fprintf(stderr, "[Error %s] %s\n", s1, s2); return 0;}

int init(void)
{
    int i;
    pat_t *p;

    for (i = 0, p = pat_ops; p[i].str; i++)
        if (regcomp(&(p[i].re), p[i].str, REG_NEWLINE|REG_EXTENDED))
            fail("comp", p[i].str);

    for (i = 0, p = pat_arg; p[i].str; i++)
        if (regcomp(&(p[i].re), p[i].str, REG_NEWLINE|REG_EXTENDED))
            fail("comp", p[i].str);

    return 1;
}

pat_t* match(const char *s, pat_t *p, str_tok_t * t, const char **e)
{
    int i;
    regmatch_t m;

    while (*s == ' ') s++;
    *e = s;

    if (!*s) return &pat_eos;

    for (i = 0; p[i].str; i++) {
        if (regexec(&(p[i].re), s, 1, &m, REG_NOTEOL))
            continue;
        t->s = s;
        *e = s + (t->len = m.rm_eo - m.rm_so);
        return p + i;
    }
    return 0;
}

int parse(const char *s) {
    pat_t *p;
    str_tok_t *t, tok;

    prec_booster = l_queue = 0;
    display(s);
    while (*s) {
        p = match(s, pat_arg, &tok, &s);
        if (!p || p == &pat_eos) fail("parse arg", s);

        /* Odd logic here. Don't actually stack the parens: don't need to. */
        if (p->prec == -1) {
            prec_booster += 100;
            continue;
        }
        qpush(tok);
        display(s);

re_op:      p = match(s, pat_ops, &tok, &s);
        if (!p) fail("parse op", s);

        tok.assoc = p->assoc;
        tok.prec = p->prec;

        if (p->prec > 0)
            tok.prec = p->prec + prec_booster;
        else if (p->prec == -1) {
            if (prec_booster < 100)
                fail("unmatched )", s);
            tok.prec = prec_booster;
        }

        while (l_stack) {
            t = stack + l_stack - 1;
            if (!(t->prec == tok.prec && t->assoc == A_L)
                    && t->prec <= tok.prec)
                break;
            qpush(spop());
            display(s);
        }

        if (p->prec == -1) {
            prec_booster -= 100;
            goto re_op;
        }

        if (!p->prec) {
            display(s);
            if (prec_booster)
                fail("unmatched (", s);
            return 1;
        }

        spush(tok);
        display(s);
    }

    return 1;
}

int main()
{
    int i;
    const char *tests[] = { 
        "3 + 4 * 2 / ( 1 - 5 ) ^ 2 ^ 3",    /* RC mandated: OK */
        "123",                  /* OK */
        "3+4 * 2 / ( 1 - 5 ) ^ 2 ^ 3.14",   /* OK */
        "(((((((1+2+3**(4 + 5))))))",       /* bad parens */
        "a^(b + c/d * .1e5)!",          /* unknown op */
        "(1**2)**3",                /* OK */
        0
    };

    if (!init()) return 1;
    for (i = 0; tests[i]; i++) {
        printf("Testing string `%s'   <enter>\n", tests[i]);
        getchar();

        printf("string `%s': %s\n\n", tests[i],
            parse(tests[i]) ? "Ok" : "Error");
    }

    return 0;
}