1
votes

Im trying to parse cpp using python. I generated the parser with ANTLR for python and now I want to visit the tree and gather some information.

  • Is there anyway to dump the ANTLR tree as AST in JSON format?
  • I was trying to trace the function calls I was expecting something like CallExpr but I couldn't find anything in generated parser files.

This is the grammar file im using https://github.com/antlr/grammars-v4/blob/master/cpp/CPP14.g4

I tried the following command to get the CPP parser, java -jar antlr-4.8-complete.jar -Dlanguage=Python3 ./CPP14.g4 -visitor

and this is the very basic code i have

import sys
import os
from antlr4 import *
from CPP14Lexer import *
from CPP14Parser import *
from CPP14Visitor import *



class TREEVisitor(CPP14Visitor):
    def __init__(self):
        pass


    def visitExpressionstatement(self, ctx):
        print(ctx.getText())
        return self.visitChildren(ctx)



if __name__ == '__main__':
    dtype = ""
    input_stream = FileStream(sys.argv[1])
    cpplex = CPP14Lexer(input_stream)
    commtokstream = CommonTokenStream(cpplex)
    cpparser = CPP14Parser(commtokstream)
    print("parse errors: {}".format(cpparser._syntaxErrors))

    tree = cpparser.translationunit()

    tv = TREEVisitor()
    tv.visit(tree)

and the input file im trying to parse,

#include <iostream>

using namespace std;


int foo(int i, int i2)
{
    return i * i2;
}

int main(int argc, char *argv[])
{
    cout << "test" << endl;
    foo(1, 3);
    return 0;
}

Thanks

1
Can you add a link to the grammar you're using? Can you also update your question and post the code you're using to parse the CPP source file? (as well as the contents of this source file)Bart Kiers
In the grammar, I don't see a rule called CallExpr: why are you expecting it? What is the input you're parsing?Bart Kiers
The question is how can I visit the function calls like foo(1, 2, 3); I thought maybe that is defined under another name in that grammar file!Alex
remove def __init__(self): passeyllanesc
@eyllanesc that might be good advice, but it has nothing to do with the problem at hand. And when giving advice, it would be nice to explain why, or provide a link that explains this.Bart Kiers

1 Answers

2
votes

Function calls are recognised by the postfixexpression rule:

postfixexpression
   : primaryexpression
   | postfixexpression '[' expression ']'
   | postfixexpression '[' bracedinitlist ']'
   | postfixexpression '(' expressionlist? ')'   // <---- this alternative!
   | simpletypespecifier '(' expressionlist? ')'
   | typenamespecifier '(' expressionlist? ')'
   | simpletypespecifier bracedinitlist
   | typenamespecifier bracedinitlist
   | postfixexpression '.' Template? idexpression
   | postfixexpression '->' Template? idexpression
   | postfixexpression '.' pseudodestructorname
   | postfixexpression '->' pseudodestructorname
   | postfixexpression '++'
   | postfixexpression '--'
   | Dynamic_cast '<' thetypeid '>' '(' expression ')'
   | Static_cast '<' thetypeid '>' '(' expression ')'
   | Reinterpret_cast '<' thetypeid '>' '(' expression ')'
   | Const_cast '<' thetypeid '>' '(' expression ')'
   | typeidofthetypeid '(' expression ')'
   | typeidofthetypeid '(' thetypeid ')'
   ;

So if you add this to your visitor:

def visitPostfixexpression(self, ctx:CPP14Parser.PostfixexpressionContext):
    print(ctx.getText())
    return self.visitChildren(ctx)

It will get printed. Note that it will now print a lot more than function calls, since it matches much more than that. You could label the alternatives:

postfixexpression
   : primaryexpression                                     #otherPostfixexpression
   | postfixexpression '[' expression ']'                  #otherPostfixexpression
   | postfixexpression '[' bracedinitlist ']'              #otherPostfixexpression
   | postfixexpression '(' expressionlist? ')'             #functionCallPostfixexpression
   | simpletypespecifier '(' expressionlist? ')'           #otherPostfixexpression
   | typenamespecifier '(' expressionlist? ')'             #otherPostfixexpression
   | simpletypespecifier bracedinitlist                    #otherPostfixexpression
   | typenamespecifier bracedinitlist                      #otherPostfixexpression
   | postfixexpression '.' Template? idexpression          #otherPostfixexpression
   | postfixexpression '->' Template? idexpression         #otherPostfixexpression
   | postfixexpression '.' pseudodestructorname            #otherPostfixexpression
   | postfixexpression '->' pseudodestructorname           #otherPostfixexpression
   | postfixexpression '++'                                #otherPostfixexpression
   | postfixexpression '--'                                #otherPostfixexpression
   | Dynamic_cast '<' thetypeid '>' '(' expression ')'     #otherPostfixexpression
   | Static_cast '<' thetypeid '>' '(' expression ')'      #otherPostfixexpression
   | Reinterpret_cast '<' thetypeid '>' '(' expression ')' #otherPostfixexpression
   | Const_cast '<' thetypeid '>' '(' expression ')'       #otherPostfixexpression
   | typeidofthetypeid '(' expression ')'                  #otherPostfixexpression
   | typeidofthetypeid '(' thetypeid ')'                   #otherPostfixexpression
   ;

and you can then do:

def visitFunctionCallPostfixexpression(self, ctx:CPP14Parser.FunctionCallPostfixexpressionContext):
    print(ctx.getText())
    return self.visitChildren(ctx)

and then only foo(1,3) gets printed (note that you might want to label more rules as functionCallPostfixexpression inside the postfixexpression rule).

Is there anyway to dump the ANTLR tree as AST in JSON format?

No.

But you could easily create something yourself of course: the objects returned by each parser rule, like translationunit, contains the entire tree. A quick and dirty example:

import antlr4
from antlr4.tree.Tree import TerminalNodeImpl
import json

# import CPP14Lexer, CPP14Parser, ...


def to_dict(root):
    obj = {}
    _fill(obj, root)
    return obj


def _fill(obj, node):

    if isinstance(node, TerminalNodeImpl):
        obj["type"] = node.symbol.type
        obj["text"] = node.getText()
        return

    class_name = type(node).__name__.replace('Context', '')
    rule_name = '{}{}'.format(class_name[0].lower(), class_name[1:])
    arr = []
    obj[rule_name] = arr

    for child_node in node.children:
        child_obj = {}
        arr.append(child_obj)
        _fill(child_obj, child_node)


if __name__ == '__main__':
    source = """
        #include <iostream>

        using namespace std;

        int foo(int i, int i2)
        {
            return i * i2;
        }

        int main(int argc, char *argv[])
        {
            cout << "test" << endl;
            foo(1, 3);
            return 0;
        }
        """
    lexer = CPP14Lexer(antlr4.InputStream(source))
    parser = CPP14Parser(antlr4.CommonTokenStream(lexer))
    tree = parser.translationunit()
    tree_dict = to_dict(tree)
    json_str = json.dumps(tree_dict, indent=2)
    print(json_str)