Counting tokens using ANTLR4

Question

I need to write a Java program, using ANTLR4 that, given a source file with a single method, can count the number of variables, operators, punctuation symbols and reserved words.

How can I use ANTLR4 to count tokens based on their type?

mittelmania mittelmania · Accepted Answer · 2015-08-15T09:39:46

After doing some research, and based on Özhan Düz, I realized that what I needed requires two techniques:

Operators, reserved words and punctuation marks can be counted using an ANTLR4 lexer, since these can be identified in the source code without putting them into context.
Variables (and also constants, methods, classes...) can be counted using an ANTLR4 parser, since identifying them requires parsing and understanding the context in which these identifiers appear in.

For the sake of all people who will need to do something similar in the future, here's exactly how I did it:

1) Use the ANTLR command line tool to generate a Lexer, Parser and BaseListener for your language. Instructions for how to do it can be found on the ANTLR official website. In this example I created these classes for analyzing the Java language.

2) Create a new Java project. Add the JavaLexer.java, JavaListener.java, JavaParser.java and JavaBaseListener.java to your project, and add the ANTLR library to your project's build path.

3) Create a new class extending the JavaBaseListener base class. Take a look at the JavaBaseListener.java file for all the methods you can override. When scanning the source code's AST, each method will be invoked at the time the corresponding event has occurred (for example - enterMethodDeclaration() will be invoked each time the parser has reached a new method declaration).

For example, this listener will raise a counter by 1 each time it has found a new method:

public static final AtomicInteger count = new AtomicInteger();

/**
 * Implementation of the abstract base listener
 */
public static class MyListener extends JavaBaseListener {
    /**
     * Overrides the default callback called whenever the walker has entered a method declaration.
     * This raises the count every time a new method is found
     */
    @Override
    public void enterMethodDeclaration(JavaParser.MethodDeclarationContext ctx) {
        count.incrementAndGet();
    }
}

4) Create a Lexer, a Parser, a ParseTree and a ParseTreeWalker:

Lexer - Runs over your code, from start to finish, and splits it into "tokens" - identifiers, literals, operators, etc. Each token has a name and a type. The list of types can be found at the beginning of your lexer file (in our case, JavaLexer.java)
Parser - Uses the lexer's output to build an AST (abstract syntax tree) representing your code. This would allow, in addition to tokenizing your source code, to understand in which context each token appears.
ParseTree - Either your entire code's AST or a subtree of it
ParseTreeWalker - An object that allows to "walk" the tree, which basically means to scan your code hierarchically instead of from start to finish

Then, finally, instantiate your listener and walk the ParseTree.

For example:

public static void main(String... args) throws IOException {
    JavaLexer lexer = new JavaLexer(new ANTLRFileStream(sourceFile, "UTF-8"));
    JavaParser parser = new JavaParser(new CommonTokenStream(lexer));
    ParseTree tree = parser.compilationUnit();

    ParseTreeWalker walker = new ParseTreeWalker();
    MyListener listener = new MyListener();
    walker.walk(listener, tree);
}

This is the basis. The next steps depend on what you want to achieve, and this brings me back to the difference between using a Lexer and a Parser:

For basic lexical analysis of your code, like identifying operators and reserved words, use the lexer to iterate over your tokens and determine their type by checking the Token.type field. Use this code to count the number of reserved words inside a method:

private List<Token> tokenizeMethod(String method) {
    JavaLexer lex = new JavaLexer(new ANTLRInputStream(method));
    CommonTokenStream tokStream = new CommonTokenStream(lex);
    tokStream.fill();

    return tokStream.getTokens();
}


/**
 * Returns the number of reserved words inside the given method, using lexical analysis
 * @param method The method text
 */
private int countReservedWords(String method) {
    int count = 0;

    for(Token t : tokenizeMethod(method)) {
        if(t.getType() <= JavaLexer.WHILE) {
            count++;
        }
    }

    return count;
}

For tasks that require parsing the AST, like identifying variables, methods, annotations and more, use a Parser. Use this code to count the number of variable declarations inside a method:

/**
 * Returns the number of variable declarations inside the given method, by parsing the method's AST
 * @param method The method text
 */
private int countVariableDeclarations(String method) {
    JavaLexer lex = new JavaLexer(new ANTLRInputStream(method));
    JavaParser parse = new JavaParser(new CommonTokenStream(lex));
    ParseTree tree = parse.methodDeclaration();

    ParseTreeWalker walker = new ParseTreeWalker();
    final AtomicInteger count = new AtomicInteger();
    walker.walk(new JavaBaseListener() {
        @Override public void enterLocalVariableDeclaration(JavaParser.LocalVariableDeclarationContext ctx) {
            count.incrementAndGet();
        }
    }, tree);

    return count.get();
}

Counting tokens using ANTLR4

2 Answers