Grammar for ANLTR 4

Question

I'm trying to develop a grammar to parse a DSL using ANTLR4 (first attempt at using it) The grammar itself is somewhat similar to SQL in the sense that should

It should be able to parse commands like the following:

select type1.attribute1 type2./xpath_expression[@id='test 1'] type3.* from source1 source2 
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where (type1.attribute2 = "XX" AND 
    (type1.attribute3 <= "2014-01-12T00:00:00.123456+00:00" OR
    type2./another_xpath_expression = "YY"))

EDIT: I've updated the grammar switching CHAR, SYMBOL and DIGIT to fragment as suggested by [lucas_trzesniewski], but I did not manage to get improvements. Attached is the parse tree as suggested by Terence. I get also in the console the following (I'm getting more confused...):

warning(125): API.g4:16:8: implicit definition of token 'CHAR' in parser
warning(125): API.g4:20:31: implicit definition of token 'SYMBOL' in parser
line 1:12 mismatched input 'p' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:19 mismatched input 't' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:27 mismatched input 'm' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:35 mismatched input '@' expecting {NUMBER, CHAR, SYMBOL}
line 1:58 no viable alternative at input 'm'
line 3:13 no viable alternative at input '(deco.m'

I was able to put together the bulk of the grammar, but it fails to properly match all the tokens, therefore resulting in incorrect parsing depending on the complexity of the input. By browsing on internet it seems to me that the main reason is down to the lexer selecting the longest matching sequence, but even after several attempts of rewriting lexer and grammar rules I could not achieve a robust set.

Below are my grammar and some test cases. What would be the correct way to specify the rules? should I use lexer modes ?

GRAMMAR

grammar API;

get : K_SELECT  (((element) )+ | '*') 
      'from'  (source )+
      ( K_FROM_DATE dateTimeOffset )? ( K_TO_DATE dateTimeOffset )?
      ('where'  expr )?  
      EOF
    ;


element     : qualifier DOT attribute; 
qualifier   : 'raw' | 'std' | 'deco' ;
attribute   : ( word | xpath | '*') ;

word  : CHAR (CHAR | NUMBER)*;

xpath   : (xpathFragment+);
xpathFragment
    : '/' ( DOT | CHAR | NUMBER | SYMBOL )+ 
    | '[' (CHAR | NUMBER | SYMBOL )+ ']'
    ;

source      : ( 'system1' | 'system2' | 'ALL')  ; // should be generalised.


date        : (NUMBER MINUS NUMBER MINUS NUMBER) ;
time        : (NUMBER COLON NUMBER (COLON NUMBER ( DOT NUMBER )?)? ( 'Z' | SIGN (NUMBER COLON NUMBER )));
dateTimeOffset : date 'T' time;

filter      : (element OP value) ;
value       : QUOTE .+? QUOTE ;

expr
    :  filter 
    | '(' expr 'AND' expr ')'
    | '(' expr 'OR'  expr ')'
    ;


K_SELECT    : 'select';
K_RANGE     : 'range';
K_FROM_DATE : 'fromDate';
K_TO_DATE   : 'toDate'  ;


QUOTE : '"' ;
MINUS : '-';
SIGN  : '+' | '-';
COLON : ':';
COMMA : ',';
DOT   : '.';
OP    : '=' | '<' | '<=' | '>' | '>=' | '!=';


NUMBER : DIGIT+;

fragment DIGIT : ('0'..'9');
fragment CHAR   : [a-z] | [A-Z] ;
fragment SYMBOL : '@' | [-_=] | '\'' | '/' | '\\' ;

WS    : [ \t\r\n]+ -> skip ;
NONWS : ~[ \t\r\n];

TEST 1

select raw./priobj/tradeid/margin[@id='222'] deco.* deco.marginType from system1 system2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z 
where ( deco.marginType >= "MV" AND ( ( raw.CretSysInst = "RMS_EXODUS" OR deco.ExtSysNum <= "1234" ) OR deco.ExtSysStr = "TEST Spaced" ) )

TEST 2

select * from ALL

TEST 3

select deco./xpath/expr/text() deco./xpath/expr[a='3' and b gt '6] raw.* from ALL where raw.attr3 = "myvalue"

The image shows that my grammar is unable to recognise several parts of the commands parse tree

What is a bit puzzling me is that the single parts are instead working properly, e.g. parsing only the 'expr' as shown by the tree below expr tree

I suggest looking at the parse tree to figure out how the grammar is recognizing your input. See grun tool/script with -gui option — Terence Parr
You did not replace word: (CHAR (CHAR | NUMBER)+); with WORD: CHAR (CHAR | NUMBER)*;, so each letter becomes a token. The parser won't be able to cope until you get the lexer grammar right. PS: Next time you edit your post, please leave a comment or it could go unnoticed. — Lucas Trzesniewski
Thanks @LucasTrzesniewski, I now replaced almost all the rules with tokens and re-organised the grammar, but still facing issues. I'll probably post the simplyfied grammar on another question. — Daniele
Thanks @TheANTLRGuy, I was indeed using the grun tool to debug my grammar against a set of tests. I can see the tree but I can't figure out which rule/token Antlr is applying to different parts of the inputs. Is there any way to show this information? (I'm pretty sure this is my fault and there is indeed a way, just was unable to find it; I will update this comment also with a link to another question where I'll explain this in more detail) — Daniele
Try out the profiler: groups.google.com/forum/#!topic/antlr-discussion/5OI0FGKEk68 — Terence Parr

Lucas Trzesniewski Lucas Trzesniewski · Accepted Answer · 2014-06-30T18:48:39

That kind of thing: word : (CHAR (CHAR | NUMBER)+); is indeed a job for the lexer, not the parser.

This: DIGIT : ('0'..'9'); should be a fragment. Same goes for this: CHAR : [a-z] | [A-Z] ;. That way, you could write NUMBER : CHAR+;, and WORD: CHAR (CHAR | NUMBER)*;

The reason is simple: you want to deal with meaningful tokens in your parser, not with parts of words. Think of the lexer as the thing that will "cut" the input text at meaningful points. Later on, you want to process full words, not individual characters. So think about where is it most meaningful to make those cuts.

Now, as the ANTLR master has pointed out, to debug your problem, dump the parse tree and see what goes on.

Grammar for ANLTR 4

1 Answers