it's my first question here :)
I'd like to build an heterogeneous AST with ANTLR for a simple grammar. There are different Interfaces to represent the AST nodes, e. g. IInfiExp, IVariableDecl. ANTLR comes up with CommonTree to hold all the information of the source code (line number, character position etc.) and I want to use this as a base for the implementations of the AST interfacese IInfixExp ...
In order to get an AST as output with CommonTree as node types, I set:
options {
language = Java;
k = 1;
output = AST;
ASTLabelType = CommonTree;
}
The IInifxExp is:
package toylanguage;
public interface IInfixExp extends IExpression {
public enum Operator {
PLUS, MINUS, TIMES, DIVIDE;
}
public Operator getOperator();
public IExpression getLeftHandSide();
public IExpression getRightHandSide();
}
and the implementation InfixExp is:
package toylanguage;
import org.antlr.runtime.Token;
import org.antlr.runtime.tree.CommonTree;
// IInitializable has only void initialize()
public class InfixExp extends CommonTree implements IInfixExp, IInitializable {
private Operator operator;
private IExpression leftHandSide;
private IExpression rightHandSide;
InfixExp(Token token) {
super(token);
}
@Override
public Operator getOperator() {
return operator;
}
@Override
public IExpression getLeftHandSide() {
return leftHandSide;
}
@Override
public IExpression getRightHandSide() {
return rightHandSide;
}
// from IInitializable. get called from ToyTreeAdaptor.rulePostProcessing
@Override
public void initialize() {
// term ((PLUS|MINUS) term)+
// atom ((TIMES|DIIDE) atom)+
// exact 2 children
assert getChildCount() == 2;
// left and right child are IExpressions
assert getChild(0) instanceof IExpression
&& getChild(1) instanceof IExpression;
// operator
switch (token.getType()) {
case ToyLanguageParser.PLUS:
operator = Operator.PLUS;
break;
case ToyLanguageParser.MINUS:
operator = Operator.MINUS;
break;
case ToyLanguageParser.TIMES:
operator = Operator.TIMES;
break;
case ToyLanguageParser.DIVIDE:
operator = Operator.DIVIDE;
break;
default:
assert false;
}
// left and right operands
leftHandSide = (IExpression) getChild(0);
rightHandSide = (IExpression) getChild(1);
}
}
The corresponding rules are:
exp // e.g. a+b
: term ((PLUS<InfixExp>^|MINUS<InfixExp>^) term)*
;
term // e.g. a*b
: atom ((TIMES<InfixExp>^|DIVIDE<InfixExp>^) atom)*
;
This works fine, becouse PLUS, MINUS etc. are "real" tokens.
But now comes to the imaginary token:
tokens {
PROGRAM;
}
The corresponding rule is:
program // e.g. var a, b; a + b
: varDecl* exp
-> ^(PROGRAM<Program> varDecl* exp)
;
With this, ANTLR doesn't create a tree with PROGRAM as root node.
In the parser, the following code creates the Program instance:
root_1 = (CommonTree)adaptor.becomeRoot(new Program(PROGRAM), root_1);
Unlike InfixExp not the Program(Token) constructor but Program(int) is invoked.
Program is:
package toylanguage;
import java.util.Collections;
import java.util.LinkedList;
import java.util.List;
import org.antlr.runtime.Token;
import org.antlr.runtime.tree.CommonTree;
class Program extends CommonTree implements IProgram, IInitializable {
private final LinkedList<IVariableDecl> variableDeclarations = new LinkedList<IVariableDecl>();
private IExpression expression = null;
Program(Token token) {
super(token);
}
public Program(int tokeType) {
// What to do?
super();
}
@Override
public List<IVariableDecl> getVariableDeclarations() {
// don't allow to change the list
return Collections.unmodifiableList(variableDeclarations);
}
@Override
public IExpression getExpression() {
return expression;
}
@Override
public void initialize() {
// program: varDecl* exp;
// at least one child
assert getChildCount() > 0;
// the last one is a IExpression
assert getChild(getChildCount() - 1) instanceof IExpression;
// iterate over varDecl*
int i = 0;
while (getChild(i) instanceof IVariableDecl) {
variableDeclarations.add((IVariableDecl) getChild(i));
i++;
}
// exp
expression = (IExpression) getChild(i);
}
}
you can see the constructor:
public Program(int tokeType) {
// What to do?
super();
}
as a result of it, with super() a CommonTree ist build without a token. So CommonTreeAdaptor.rulePostProcessing see a flat list, not a tree with a Token as root.
My TreeAdaptor looks like:
package toylanguage;
import org.antlr.runtime.tree.CommonTreeAdaptor;
public class ToyTreeAdaptor extends CommonTreeAdaptor {
public Object rulePostProcessing(Object root) {
Object result = super.rulePostProcessing(root);
// check if needs initialising
if (result instanceof IInitializable) {
IInitializable initializable = (IInitializable) result;
initializable.initialize();
}
return result;
};
}
And to test anything I use:
package toylanguage;
import org.antlr.runtime.ANTLRStringStream;
import org.antlr.runtime.CommonTokenStream;
import org.antlr.runtime.RecognitionException;
import org.antlr.runtime.TokenStream;
import org.antlr.runtime.tree.CommonTree;
import toylanguage.ToyLanguageParser.program_return;
public class Processor {
public static void main(String[] args) {
String input = "var a, b; a + b + 123"; // sample input
ANTLRStringStream stream = new ANTLRStringStream(input);
ToyLanguageLexer lexer = new ToyLanguageLexer(stream);
TokenStream tokens = new CommonTokenStream(lexer);
ToyLanguageParser parser = new ToyLanguageParser(tokens);
ToyTreeAdaptor treeAdaptor = new ToyTreeAdaptor();
parser.setTreeAdaptor(treeAdaptor);
try {
// test with: var a, b; a + b
program_return program = parser.program();
CommonTree root = program.tree;
// prints 'a b (+ a b)'
System.out.println(root.toStringTree());
// get (+ a b), the third child of root
CommonTree third = (CommonTree) root.getChild(2);
// prints '(+ a b)'
System.out.println(third.toStringTree());
// prints 'true'
System.out.println(third instanceof IInfixExp);
// prints 'false'
System.out.println(root instanceof IProgram);
} catch (RecognitionException e) {
e.printStackTrace();
}
}
}
For completeness, here is the full grammar:
grammar ToyLanguage;
options {
language = Java;
k = 1;
output = AST;
ASTLabelType = CommonTree;
}
tokens {
PROGRAM;
}
@header {
package toylanguage;
}
@lexer::header {
package toylanguage;
}
program // e.g. var a, b; a + b
: varDecl* exp
-> ^(PROGRAM<Program> varDecl* exp)
;
varDecl // e.g. var a, b;
: 'var'! ID<VariableDecl> (','! ID<VariableDecl>)* ';'!
;
exp // e.g. a+b
: term ((PLUS<InfixExp>^|MINUS<InfixExp>^) term)*
;
term // e.g. a*b
: atom ((TIMES<InfixExp>^|DIVIDE<InfixExp>^) atom)*
;
atom
: INT<IntegerLiteralExp> // e.g. 123
| ID<VariableExp> // e.g. a
| '(' exp ')' -> exp // e.g. (a+b)
;
INT : ('0'..'9')+ ;
ID : ('a'..'z')+ ;
PLUS : '+' ;
MINUS : '-' ;
TIMES : '*' ;
DIVIDE : '/' ;
WS : ('\t' | '\n' | '\r' | ' ')+ { $channel = HIDDEN; } ;
OK, the final question is how to get from
program // e.g. var a, b; a + b
: varDecl* exp
-> ^(PROGRAM<Program> varDecl* exp)
;
a tree with PROGRAM as root
^(PROGRAM varDecl* exp)
and not a flat list with
(varDecl* exp) ?
(Sorry for this numerous code fragments)
Ciao Vertex