0
votes

Using this https://github.com/antlr/grammars-v4/tree/master/cpp antlr grammar Im trying to parse C++ code. I want to get the code for each function so I decided to do visit visitFunctionBody and you can see the code below,

#include <iostream>
#include <antlr4-runtime.h>

#include "parser/CPP14Lexer.h"
#include "parser/CPP14BaseVisitor.h"
#include "parser/CPP14Parser.h"
#include "parser/CPP14Visitor.h"


class TREEVisitor : public CPP14BaseVisitor {
    public:
        virtual antlrcpp::Any TREEVisitor::visitFunctionBody(
            CPP14Parser::FunctionBodyContext *ctx) override
        {
            std::cout << ctx->getText() << std::endl;
            return visitChildren(ctx);
        }
};


int main(int argc, char *argv[]) {

    std::ifstream stream;
    stream.open(argv[1]);
    antlr4::ANTLRInputStream input(stream);
    CPP14Lexer lexer(&input);
    antlr4::CommonTokenStream tokens(&lexer);
    CPP14Parser parser(&tokens);
    antlr4::tree::ParseTree *tree = parser.translationunit();

    // Visitor
    auto *visitor = new TREEVisitor();
    visitor->visit(tree);

    return 0;
}

and I tried to parse this very basic c++ code,

void foo()
{
    char buf[10];
    int i = 10;
    int b = i * 2;
    return b * i;
}

The output of my antlr visitor function is the code of the sample function but without any newline and indention like below,

{charbuf[10];inti=10;intb=i*2;returnb*i;}

How can I get the source code of the function that Im parsing as it is in the source file?

In my use case I parse a big C++ file and I want to match the result of my parsing with the actual source code.

Thanks

2

2 Answers

1
votes

Here is one way, but there are others. In CPP14Lexer.g4, change "-> skip" to "-> channel(HIDDEN)". Then, in visitFunctionBody(), change the call "getText()" to "myGetText(ctx)", and define the routine myGetText() *like this, but for C++". This code is in Java.

public String myGetText(ParseTree node) {
    if (node.getChildCount() == 0) {
        Token t = ((TerminalNodeImpl)node).getSymbol();
        List<Token> tokensBefore = tokens.getHiddenTokensToLeft(t.getTokenIndex(), Token.HIDDEN_CHANNEL);
        String pre = "";
        if (tokensBefore != null) {
            StringBuilder builder2 = new StringBuilder();
            for (Token token : tokensBefore) {
                CharStream input = token.getInputStream();
                String s = input.getText(Interval.of(token.getStartIndex(),token.getStopIndex()));
                builder2.append(s);
            }
            pre = builder2.toString();
        }
        String s2 = node.getText();
        String ss = pre + s2;
        return ss;
    }
    StringBuilder builder = new StringBuilder();
    for (int i = 0; i < node.getChildCount(); i++) {
        String s = myGetText(node.getChild(i));
        builder.append(s);
    }
    return builder.toString();
}

You can also reconstruct the text without changing the "skip" to "HIDDEN" by directly querying the char stream the characters involved between leaf nodes in the tree.

static void Reconstruct(ParseTree node, Parser parser)
{
    var ct = (ParserRuleContext)node;
    Token ta = ct.getStart();
    Token tb = ct.getStop();
    var input_stream = ta.getInputStream();
    var start = ta.getStartIndex();
    var stop = tb.getStopIndex();
    System.out.println(input_stream.getText(new Interval(start, stop)));
}
0
votes

You just need to keep the whitespace tokens (shove them off to the HIDDEN channel).

Then you'll need access to your TokenStream in your Listener/Visitor. You can then create the Interval for your context and getText(interval). This will include tokens on the HIDDEN channel.

Example:

In your lexer, change -> skip to -> channel(HIDDEN):

Whitespace: [ \t]+ -> channel(HIDDEN);

Newline: ('\r' '\n'? | '\n') -> channel(HIDDEN);

BlockComment: '/*' .*? '*/' -> channel(HIDDEN);

LineComment: '//' ~ [\r\n]* -> channel(HIDDEN);

After parsing your input, pass your TokenStream to your Listener.

...
  CommonTokenStream tokens = new CommonTokenStream(lexer);
...
  ParseTree tree = parser.translationUnit(); 
...
  CPPListener listener = new CPPListener(tokens); 
  ParseTreeWalker walker = new​ ParseTreeWalker();
  walker.walk(listener, tree);

Then in your listener:

class CPPListener extends CPP14ParserBaseListener {
    TokenStream tokenStream;
    
    CPPListener(TokenStream tokenStream) {
        this.tokenStream = tokenStream;
    }

    @Override
    public void exitFunctionDefinition(CPP14Parser.FunctionDefinitionContext ctx) {
        Interval interval = new Interval(
           ctx.start.getTokenIndex(),
           ctx.stop.getTokenIndex()
        );
        String source = tokenStream.getText(interval);
        System.out.println(source);
    }
}

Output on your example:

void foo()
{
    char buf[10];
    int i = 10;
    int b = i * 2;
    return b * i;
}