0
votes

I am following this tutorial and successfully replicated its behavior except that I am using Antlr 4.7 instead of the 4.5 that the tutorial was using.

I am trying to build a DSL for expense tracker.

Was wondering if each element can have attributes?

E.g. this is what it looks like now

enter image description here

This is the code for the todo.g4 as seen in https://github.com/simkimsia/learn-antlr-web-js/blob/master/todo.g4

grammar todo;

elements
    : (element|emptyLine)* EOF
    ;

element
    : '*' ( ' ' | '\t' )* CONTENT NL+
    ;

emptyLine
    : NL
    ;

NL
    : '\r' | '\n' 
    ;

CONTENT
    : [a-zA-Z0-9_][a-zA-Z0-9_ \t]*
    ;    

Meaning to say the element will also have 2 attributes such as amount and payee. To keep it simple, I will have the same sentence structure so to allow parsing to be done more easily.

the format will be pay [payee] [amount]

the example is pay Acme Corp 123,789.45

so the payee is Acme Corp and the amount is 12378945 as expressed in integers to denote the amount in denominations of cents

another example is pay Banana Inc 700

so the payee is Banana Inc and the amount is 70000 as expressed in integers to denote the amount in denominations of cents

I am guessing I need to change the todo.g4 and then re generate the parser.

Can an element have other attributes? If so, how do I get started?

UPDATE

This is my latest attempts ranked with latest updates on top:

I just figured out how to use grun and testRig. Thanks @Raven for that tip.

latest attempt: My latest expense.g4 (only difference from earlier attempt is the regex for payment)

grammar expense;

payments: (payment NL)* ;  
payment: PAY receiver amount=NUMBER ;  
receiver: surname=ID (lastname=ID)? ;  

PAY: 'pay' ;
NUMBER: ([0-9]+(','[0-9]+)*)('.'[0-9]*)?;
ID: [a-zA-Z0-9_]+ ;
NL: '\n' | '\r\n' ;  
WS: [\t ]+ -> skip ;

enter image description here

Earlier attempt: This is my expense.g4

grammar expense;

payments: (payment NL)* ;  
payment: PAY receiver amount=NUMBER ;  
receiver: surname=ID (lastname=ID)? ;  

PAY: 'pay' ;
NUMBER: [0-9]+ (',' [0-9]+)+ ('.' [0-9]+)? ;  
ID: [a-zA-Z0-9_]+ ;
NL: '\n' | '\r\n' ;  
WS: [\t ]+ -> skip ;

enter image description here

enter image description here

Earlier attempt: https://github.com/simkimsia/learn-antlr-web-js/commit/728813ac275a3f2ad16d7f51ce15fcc27d40045b#commitcomment-25127606

Earlier attempt: https://github.com/simkimsia/learn-antlr-web-js/commit/0c32aec6ffb4b4275db86d54e9788058a2ce8759#commitcomment-25125695

3
I don't understand all the code, but have an eagle eye for typos. Line 56 : var tokens = new new antlr4.CommonTokenStream(expenseLexer); --> two times new, probable cause of error.BernardK
@BernardK Thanks. I have removed the extra new and used the payments function as suggested by Raven. I still see empty array when I tried to console.logKim Stacks

3 Answers

2
votes

Situation on October 24. 2017 at 19:00 UTC+1.

Your grammar works perfectly. I made a full test in Java.

File Expense.g4 :

grammar Expense;

payments
@init {System.out.println("Expense last update 1853");}
    : (payment NL)*
    ;

payment
    : PAY receiver amount=NUMBER
      {System.out.println("Payement found " + $amount.text + " to " + $receiver.text);}
    ;

receiver
    : surname=ID (lastname=ID)?
    ; 

PAY    : 'pay' ;
NUMBER : ([0-9]+(','[0-9]+)*)('.'[0-9]*)? ;
ID     : [a-zA-Z0-9_]+ ;
NL     : '\n' | '\r\n' ;  
WS     : [\t ]+ -> channel(HIDDEN) ; // keep the spaces (witout spaces ==> paydeltaco98)

File ExpenseMyListener.java :

public class ExpenseMyListener extends ExpenseBaseListener {
    ExpenseParser parser;
    public ExpenseMyListener(ExpenseParser parser) { this.parser = parser; }

    public void exitPayments(ExpenseParser.PaymentsContext ctx) {
        System.out.println(">>> in ExpenseMyListener for paymentsss");
        System.out.println(">>> there are " + ctx.payment().size() + " elements in the list of payments");
        for (int i = 0; i < ctx.payment().size(); i++) {
            System.out.println(ctx.payment(i).getText());
        }
    }

    public void exitPayment(ExpenseParser.PaymentContext ctx) {
        System.out.println(">>> in ExpenseMyListener for payment");
        System.out.println(parser.getTokenStream().getText(ctx));
    }
}

File test_expense.java :

import org.antlr.v4.runtime.ANTLRFileStream;
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.ParserRuleContext;
import org.antlr.v4.runtime.tree.*;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.IOException;

public class test_expense {
    public static void main(String[] args) throws IOException {
        ANTLRInputStream input = new ANTLRFileStream(args[0]);
        ExpenseLexer lexer = new ExpenseLexer(input);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        ExpenseParser parser = new ExpenseParser(tokens);
        ParseTree tree = parser.payments();
        System.out.println("---parsing ended");
        ParseTreeWalker walker = new ParseTreeWalker();
        ExpenseMyListener my_listener = new ExpenseMyListener(parser);
        System.out.println(">>>> about to walk");
        walker.walk(my_listener, tree);
    }
}

Input file top.text :

pay Acme Corp 123,456
pay Banana Inc 456789.00
pay charlie pte 123,456.89
pay delta co 98

Execution :

$ export CLASSPATH=".:/usr/local/lib/antlr-4.6-complete.jar"
$ alias
alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Expense.g4 
$ javac Ex*.java
$ javac test_expense.java 
$ grun Expense payments -tokens -diagnostics top.text
[@0,0:2='pay',<'pay'>,1:0]
[@1,3:3=' ',<WS>,channel=1,1:3]
[@2,4:7='Acme',<ID>,1:4]
[@3,8:8=' ',<WS>,channel=1,1:8]
[@4,9:12='Corp',<ID>,1:9]
...
[@32,90:89='<EOF>',<EOF>,5:0]
Expense last update 1853
Payement found 123,456 to Acme Corp
Payement found 456789.00 to Banana Inc
Payement found 123,456.89 to charlie pte
Payement found 98 to delta co

$ java test_expense top.text 
Expense last update 1853
Payement found 123,456 to Acme Corp
Payement found 456789.00 to Banana Inc
Payement found 123,456.89 to charlie pte
Payement found 98 to delta co
---parsing ended
>>>> about to walk
>>> in ExpenseMyListener for payment
pay Acme Corp 123,456
>>> in ExpenseMyListener for payment
pay Banana Inc 456789.00
>>> in ExpenseMyListener for payment
pay charlie pte 123,456.89
>>> in ExpenseMyListener for payment
pay delta co 98
>>> in ExpenseMyListener for paymentsss
>>> there are 4 elements in the list of payments
payAcmeCorp123,456
payBananaInc456789.00
paycharliepte123,456.89
paydeltaco98
1
votes

I'm not entirely sure what exactly you want but for the provided examples this grammar should do the job:

payments: (payment NL)* ;  
payment: PAY receiver amount=NUMBER ;  
receiver: surname=ID (lastname=ID)? ;  

PAY: 'pay' ;
NUMBER: [0-9]+ (',' [0-9]+)+ ('.' [0-9]+)? ;  
ID: [a-zA-Z0-9_]+ ;
NL: '\n' | '\r\n' ;  
WS: [\t ]+ -> skip ;

If this is what you were asking for I will add some more explanation if needed...

1
votes

I am guessing I need to change the todo.g4 and then re generate the parser.

Of course regenerate after each change. For me it's :

$ a4 Question.g4
$ javac Q*.java
$ grun Question elements -tokens -diagnostics t.text

where

$ alias
alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
alias grun='java org.antlr.v4.gui.TestRig'

The more you describe specific contents, the more you may face ambiguity problems. For example, you have two rules :

payment   : 'pay' [payee] [amount]
free_text : ... any character ...

Consider the following content :

* pay Federico Tomassetti 10 € for the tutorial

* pay Federico Tomassetti 10 is ambiguous and can be matched by the two rules, but it will finally be parsed as free text, because of € for the tutorial which doesn't satisfy payment.

If later you change the payment rule to accept more info after the amount :

payment   : 'pay' [payee] [amount] payment_info

the above content will be matched by payment (in case of ambiguity ANTLR chooses the first rule). The good news is that ANTLR 4 is very strong to disambiguate, it reads the whole file if necessary.

For ambiguous tokens and precedence rules, read the posts of these last three weeks, a lot have been said.

Mixing Raven's grammar with yours, this is one possible solution :

File Question.g4

grammar Question;

elements
@init {System.out.println("Question last update 1432");}
    : ( element | emptyLine )* EOF
    ;

element
    : '*' content NL
    ;

content
    : payment   //{System.out.println("Payement found " + $payment.text);}
    | free_text {System.out.println("Free text found " + $free_text.text);}
    ;

payment
    : PAY receiver amount=NUMBER
      {System.out.println("Payement found " + $amount.text + " to " + $receiver.text);}
    ;

receiver
    : surname=WORD ( lastname=WORD )?
    ;  

free_text
    : ( WORD | PAY | NUMBER )+
    ;

emptyLine
    : NL
    ;

PAY    : 'pay' ;
WORD   : LETTER ( LETTER | DIGIT | '_' )* ;
NUMBER : DIGIT+ ( ',' DIGIT+ )? ( '.' DIGIT+ )? ;  

NL  : [\r\n]
    | '\r\n' 
    ;
//WS  : [ \t]+ -> skip ; // $payment.text => payAcmeCorp123,789.45
WS  : [ \t]+ -> channel(HIDDEN) ; // spaces are needed to nicely display $payment.text

fragment DIGIT  : [0-9] ;
fragment LETTER : [a-zA-Z] ;

File t.text

* play with ANTLR 4
* write a tutorial
* pay Acme Corp 123,789.45
* pay Banana Inc 700
* pay Federico Tomassetti 10 € for the tutorial

Execution :

$ grun Question elements -tokens -diagnostics t.text
line 5:29 token recognition error at: '€'
[@0,0:0='*',<'*'>,1:0]
[@1,1:1=' ',<WS>,channel=1,1:1]
[@2,2:5='play',<WORD>,1:2]
[@3,6:6=' ',<WS>,channel=1,1:6]
[@4,7:10='with',<WORD>,1:7]
[@5,11:11=' ',<WS>,channel=1,1:11]
[@6,12:16='ANTLR',<WORD>,1:12]
[@7,17:17=' ',<WS>,channel=1,1:17]
[@8,18:18='4',<NUMBER>,1:18]
[@9,19:19='\n',<NL>,1:19]
[@10,20:20='*',<'*'>,2:0]
[@11,21:21=' ',<WS>,channel=1,2:1]
[@12,22:26='write',<WORD>,2:2]
[@13,27:27=' ',<WS>,channel=1,2:7]
[@14,28:28='a',<WORD>,2:8]
[@15,29:29=' ',<WS>,channel=1,2:9]
[@16,30:37='tutorial',<WORD>,2:10]
[@17,38:38='\n',<NL>,2:18]
...
[@56,136:135='<EOF>',<EOF>,7:0]
Question last update 1432
Free text found play with ANTLR 4
Free text found write a tutorial
line 3:26 reportAttemptingFullContext d=2 (content), input='pay Acme Corp 123,789.45
'
...
Payement found 700 to Banana Inc
Free text found pay Federico Tomassetti 10  for the tutorial

As you can see, the € symbol is not recognized. You may need a CONTENT rule similar to FIELDTEXT here, and then you get into trouble ...

Federico's Mega tutorial is a good start. For nitty-gritty details, see The Definitive ANTLR 4 Reference or the online doc from www.antlr.org.