Antlr4 parsing inconsistency

Question

in a little test-parser I just wrote, I encountered a weird problem, which I don't quite understand.

Stripping it down to the smallest example showing the problem, let's start with the following grammar:

Testing.g4:

grammar Testing;

cscript                           // This is the construct I shortened
    : (statement_list)* ;

statement_list
    : statement ';' statement_list?
    | block
    ;

statement
    : assignment_statement
    ;

block : '{' statement_list? '}' ;

expression
    : left=expression op=('*'|'/') right=expression              # arithmeticExpression
    | left=expression op=('+'|'-') right=expression              # arithmeticExpression
    | left=expression op=Comparison_operator right=expression    # comparisonExpression
    | ID                                                         # variableValueExpression
    | constant                                                   # ignore  // will be executed with the rule name
    ;

assignment_statement
    : ID op=Assignment_operator expression
    ;

constant
    : INT
    | REAL;

Assignment_operator : ('=' | '+=' | '-=') ;

Comparison_operator : ('<' | '>' | '==' | '!=') ;

Comment : '//' .*? '\n' -> skip;

fragment NUM : [0-9];

INT : NUM+;
REAL
    : NUM* '.' NUM+
    | '.' NUM+
    | INT
    ; 

ID : [a-zA-Z_] [a-zA-Z_0-9]*;

WS : [ \t\r\n]+ -> skip;

Using the input

z = x + y;

everything is fine, we get a parse tree which goes from cscript to statement_list, statement, assignment_statement, id and expression. Great!

Now, if I add the possibility to declare variables, all goes down the drain:

This is the change to the grammar:

cscript
    : (statement_list | variable_declaration ';')* ;

variable_declaration
    : type ID ('=' expression)?
    ;

type
    : 'int'
    | 'real'
    ;

statement_list
    : statement ';' statement_list?
    | block
    ;

statement
    : assignment_statement
    ;

// (continue as before)

All of a sudden, the same test-input gets wrongly dissected into two statement_lists, each continued to a statement with a "missing ';'" warning, the first going to an incomplete assignment_statement of "z =" and the second to an incomplete assignment_statement "x +".

My attempt to show the parse tree in text-form:

cscript
    statement_list
        statement
            assignment_statement
                'z'
                '=' [marked as error]
        [warning: missing ';']
    statement_list
        statement
            assignment_statement
                'x'
                '+' [marked as error]
        'y' [marked as error]
        ';'

Can anyone tell me what the problem is? (And how to fix it? ;-))

Edit on 2016-12-26, after Mike's comment:

After replacing all implicit lexer rules with explicit declarations, all of a sudden, the input "z = x + y" worked. (thumbs up)

The next thing I did was restoring more of the original example I had in mind, and adding a new input line

int x = 22;

to the input (which worked previously, but did not make it into the minimal example). Now, that line fails. This is the -token output of the test rig:

[@0,0:2='int',<4>,1:0]
[@1,4:4='x',<22>,1:4]
[@2,6:6='=',<1>,1:6]
[@3,8:9='22',<20>,1:8]
[@4,10:10=';',<12>,1:10]
[@5,13:13='z',<22>,2:0]
[@6,15:15='=',<1>,2:2]
[@7,17:17='x',<22>,2:4]
[@8,19:19='+',<18>,2:6]
[@9,21:21='y',<22>,2:8]
[@10,22:22=';',<12>,2:9]
[@11,25:24='<EOF>',<-1>,3:0]
line 1:6 mismatched input '=' expecting '='

As the problem seemed to be in the variable_declaration part, I even tried to split this into two parsing rules like this:

cscript
    : (statement_list | variable_declaration_and_assignment SEMICOLON | variable_declaration SEMICOLON)* ;

variable_declaration_and_assignment
    : type ID EQUAL expression
    ;

variable_declaration
    : type ID
    ;

With the result:

line 1:6 no viable alternative at input 'intx='

Still stuck :-( BTW: Splitting the "int x = 22;" into "int x;" and "x = 22;" works. sigh

Edit on 2016-12-26, after Mike's next comment:

Double-checked, and everything is lexer rules. Still, the mismatch between '=' and '=' (which I unfortunately cannot reconstruct anymore) gave me the idea to check the token types. The current status is:

(Shortened grammar)

cscript
    : (statement_list | variable_declaration)* ;

...

variable_declaration
    : type ID (EQUAL expression)? SEMICOLON
    ;

...

Assignment_operator : (EQUAL | PLUS_EQ | MINUS_EQ) ;

// among others
PLUS_EQ : '+=';
MINUS_EQ : '-=';
EQUAL: '=';

...

Shortened output:

[@0,0:2='int',<4>,1:0]
[@1,4:4='x',<22>,1:4]
[@2,6:6='=',<1>,1:6]
...
line 1:6 mismatched input '=' expecting ';'

Here, if I understand this correctly, the '=' is parsed to token type 1, which - according to the lexer.tokens output - is Assignment_Operator, while the expected EQUAL would be 13.

Might this be the problem?

Strange problem. Start by defining all lexical elements as lexer rules. No implicit literals in parser rules. Then let your buffered token string give you all the tokens it finds (stream.fill() then iterate over stream.getTokens(); with a token.toString() call). What tokens do you see in this list? — Mike Lischke
@MikeLischke Thanks for your input! I edited the question to include the token results and the next resulting problem. — mtj
This error: line 1:6 mismatched input '=' expecting '=' is usually a sign that there are multiple '=' definitions (e.g. literals inlined in parser rules). Double check your token declarations. Are all token literals covered by lexer rules instead of being lazily used directly in a parser rule? — Mike Lischke
@MikeLischke Thanks for your continued help, I have the feeling we might be getting there :-) Please see the second edit. — mtj
Hmm, you can use the vocabulary to see what token type represents which literal. However, in the token dump you already see the value from the vocabulary (and which seems to be correct). Why do you think is 13 what's expected? Do you still have more than one '=' token? Also, do you, by any chance, have fragment lexer rules which you use in parser rules (which doesn't work)? — Mike Lischke

Mike Lischke Mike Lischke · Accepted Answer · 2016-12-27T09:42:28

Ok, seems the main take away here is: think about your definitions and how you define them. Create explicit lexer rules for your literals instead of defining them implicitly in the parser rules. Check the token values you get from the lexer if the parser gives you weird errors, because they must be correct in the first place or your parse has no chance to do its job.

Antlr4 parsing inconsistency

1 Answers