ANTLR parser and tree grammars for one simple language

Question

Edit:

Here is the updated tree and parser grammars:

parser grammar:

    options {

language = CSharp2;

output=AST;


}
tokens {
UNARY_MINUS;
CALL;
}
program :   (function)* main_function

        ;



function:       'function' IDENTIFIER '(' (parameter (',' parameter)*)? ')' 'returns' TYPE declaration* statement* 'end' 'function'
        ->    ^('function' IDENTIFIER parameter* TYPE declaration* statement*)
    ;

main_function
    :   'function' 'main' '(' ')' 'returns' TYPE declaration* statement*  'end' 'function'
    ->    ^('function' 'main' TYPE declaration* statement*)   
    ;   

parameter
    :   'param' IDENTIFIER ':' TYPE
    ->    ^('param' IDENTIFIER TYPE)
    ;

declaration
    :       'variable' IDENTIFIER ( ',' IDENTIFIER)* ':' TYPE ';'
    ->    ^('variable' TYPE IDENTIFIER+ )
    |       'array' array  ':' TYPE ';'
    ->    ^('array' array TYPE)
    ;

statement 
    : ';'! | block | assignment | if_statement | switch_statement | while_do_statement | for_statement | call_statement | return_statement  
    ;

call_statement
    :   call ';'!
    ;

return_statement
    :   'return' expression ';'
    ->    ^('return' expression)
    ;

block   : 'begin' declaration* statement* 'end'
        -> ^('begin' declaration* statement*)
        |  '{' declaration* statement* '}'
        -> ^('{' declaration* statement*)
    ;

assignment 
    :   IDENTIFIER ':=' expression ';'
        ->      ^(':=' IDENTIFIER expression )
    |       array ':=' expression ';'
    ->     ^(':=' array expression) 
    ;

array   :   IDENTIFIER '[' expression (',' expression)* ']'
    ->  ^(IDENTIFIER expression+)
    ;

if_statement 
    :   'if' '(' expression ')' 'then' statement ('else' statement)? 'end' 'if'
    ->    ^('if' expression statement statement?)

    ;

switch_statement 
    :   'switch' '(' expression ')' case_part+ ('default' ':' statement)? 'end' 'switch'
    ->    ^('switch' expression case_part+ statement?)
    ; 

case_part
    :   'case' literal (',' literal)* ':' statement
    ->    ^('case' literal+ statement)
    ;

literal 
    :   INTEGER | FLOAT | BOOLEAN | STRING
    ; 

while_do_statement
    :   'while' '(' expression ')' 'do' statement 'end' ' while'
    ->    ^('while' expression statement)
    ;

for_statement 
    :       'for' '(' IDENTIFIER ':=' expression 'to' expression ')' 'do' statement 'end' 'for'
    ->   ^('for' IDENTIFIER expression expression statement)
    ;

expression
    :   conjuction ( 'or'^ conjuction)*
    ;

conjuction
    :       equality ('and'^ equality)* 
    ;

equality:   relation (('=' | '/=')^ relation)?
        ;

relation:   addition (('<' | '<=' | '>' | '>=')^ addition)?
    ;

addition:   multiplication (('+' | '-')^ multiplication)*   
    ;

multiplication
    :   unary_operation (('*' | '/' | '%')^ unary_operation)*
    ;
unary_operation
    :   '-' primary 
    ->   ^(UNARY_MINUS primary)
    |        'not' primary 
    ->   ^('not' primary)
    |     primary
    ;

primary :   IDENTIFIER 
        | array 
        |  literal 
        | '('! expression ')'!  
        | '(' TYPE ')'  '(' expression ')'
        -> ^(TYPE expression) 
        |  call
    ; 

call    :   IDENTIFIER '(' arguments ')'
        ->     ^(CALL IDENTIFIER arguments)
    ;

arguments
    :   (expression  (','! expression)*)? 
    ;

BOOLEAN :   'true' | 'false'
    ;   

T    YPE    : 'integer' | 'boolean' | 'float' | 'string' | 'array' | 'void'
    ;

IDENTIFIER  :   ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
    ;

INTEGER :   '0'..'9'+
    ;

FLOAT
    :   ('0'..'9')+ '.' ('0'..'9')+ 
    ;

COMMENT
    :   '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
    |   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
    ;

WS  :   ( ' '
        | '\t'
        | '\r'
        | '\n'
        ) {$channel=HIDDEN;}
    ;

STRING
    :  '"' .* '"'
    ;

And here is the updated tree grammar (I altered expressions, and so on...):

    options {
language = 'CSharp2';
//tokenVocab= token vocab needed
ASTLabelType=CommonTree; // what is Java type of nodes?

}
program :   (function)* main_function

        ;



function:     ^('function' IDENTIFIER parameter* TYPE declaration* statement*)
    ;

main_function
    :   ^('function' 'main' TYPE declaration* statement*)   
    ;   

parameter
    :   ^('param' IDENTIFIER TYPE)
    ;

declaration
    :     ^('variable' TYPE IDENTIFIER+)
        |     ^('array' array TYPE  )
    ;

statement 
    : block | assignment | if_statement | switch_statement | while_do_statement | for_statement | call_statement | return_statement 
    ;

call_statement
    :   call 
    ;

return_statement
    :   ^('return' expression)
    ;

block   : ^('begin' declaration* statement*)
        |  ^('{' declaration* statement*)
    ;

assignment 
    :   ^(':=' IDENTIFIER expression )
    |      ^(':=' array expression) 
    ;

array   :   ^(IDENTIFIER expression+)
    ;

if_statement 
    :   ^('if' expression statement statement?)

    ;

switch_statement 
    :   ^('switch' expression case_part+ statement?)
    ; 

case_part
    :   ^('case' literal+ statement)
    ;

literal 
    :   INTEGER | FLOAT | BOOLEAN | STRING
    ; 

while_do_statement
    :   ^('while' expression statement)
    ;

for_statement 
    :    ^('for' IDENTIFIER expression expression statement)
    ;

expression
    :   ^('or' expression expression)
    |      ^('and' expression expression)
    |      ^('=' expression expression)   
    |      ^('/=' expression expression)
    |       ^('<' expression expression)
    |       ^('<=' expression expression)
    |       ^('>' expression expression)
    |       ^('>=' expression expression)
    |       ^('+' expression expression)
    |       ^('-' expression expression)
    |      ^(UNARY_MINUS expression)
    |      ^('not' expression)
    |      IDENTIFIER
    |      array
    |       literal 
        |      ^(TYPE expression) 
        |      call
    ;

call    :   ^(CALL IDENTIFIER arguments)
    ;

arguments
    :   (expression  (expression)*)? 
    ;

I succesfluly generated tree graph with DOTTreeGenerator and StringTemplate classes so it seems that all is working at the moment. But any suggestions (about bad habits or something else in this grammars) are appreciated since I don't have a lot of experience with ANTLR or language recognition.

See updates on http://vladimir-radojicic.blogspot.com

Please provide the tree parser rule for your addition, too. Otherwise it's hard to tell what's going wrong. — Sebastian Zarnekow
That is impossible to tell without seeing the entire parser- and tree-grammar. — Bart Kiers
May I somehow send you whole grammar, since addition rule is calling another , and another and another rule, so it is whole grammar almost? — vldmrrdjcc
You can edit your original question and add both parser- and tree-grammars. — Bart Kiers
Here it is - parser and tree grammar I am using ANTLRWorks 1.4, and not 1.4.2 since at the moment it has some bugs when generating C# code — vldmrrdjcc

Bart Kiers Bart Kiers · Accepted Answer · 2011-05-01T06:01:35

The only thing I was going to suggest, besides introducing imaginary tokens to make sure your tree grammar produces a "unique AST" and simplifying the expression in the tree-grammar, which you both already did (again: well done!), is that you shouldn't use literal tokens inside your parser grammar. Especially not when they can possibly be matched by other lexer rule(s). For example, all your reserved words (like for, while, end, etc.) can also be matched by the lexer rule IDENTIFIER. It's better to create explicit tokens inside the lexer (and put these rules before the IDENTIFIER rule!):

...

FOR   : 'for'; 
WHILE : 'while'; 
END   : 'end';

...

IDENTIFIER  
  :  ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
  ;

...

Ideally, the tree grammar does not contain any quoted tokens. AFAIK, you can't import grammar X inside a grammar Y properly: the literal tokens inside grammar X are then not available in grammar Y. And when you split your your combined grammar in a parser- and lexer grammar, these literal tokens are not allowed. With small grammars like your, these last remarks are of no concern to you (and you could leave your grammar "as is"), but remember them when you create larger grammars.

Best of luck!

EDIT

Imaginary tokens are not only handy when there's no real token that can be made as root of the tree. The way I look at imaginary tokens is that they make your tree "unique", so that the tree grammar can only "walk" your tree in one possible way. Take subtraction and unary minus for example. If you wouldn't have created an imaginary token called UNARY_MINUS, but simply did this:

unary_operation
  :  '-' primary   -> ^('-' primary)
  |  'not' primary -> ^('not' primary)
  |  primary
  ;

then you'd have something like this in your tree grammar:

expression
  :  ^('-' expression expression)
  |  ...
  |  ^('-' expression)
  |  ...
  ;

Now both subtraction and unary minus start with the same tokens, which the tree grammar does not like! It's easy to see with this - (minus) example, but there can be quite some tricky cases (even with small grammars like yours!) that are not so obvious. So, always let the parser create "unique trees" while rewriting to AST's.

Hope that clarifies it (a bit).

ANTLR parser and tree grammars for one simple language

1 Answers

EDIT