in a little test-parser I just wrote, I encountered a weird problem, which I don't quite understand.
Stripping it down to the smallest example showing the problem, let's start with the following grammar:
Testing.g4:
grammar Testing;
cscript // This is the construct I shortened
: (statement_list)* ;
statement_list
: statement ';' statement_list?
| block
;
statement
: assignment_statement
;
block : '{' statement_list? '}' ;
expression
: left=expression op=('*'|'/') right=expression # arithmeticExpression
| left=expression op=('+'|'-') right=expression # arithmeticExpression
| left=expression op=Comparison_operator right=expression # comparisonExpression
| ID # variableValueExpression
| constant # ignore // will be executed with the rule name
;
assignment_statement
: ID op=Assignment_operator expression
;
constant
: INT
| REAL;
Assignment_operator : ('=' | '+=' | '-=') ;
Comparison_operator : ('<' | '>' | '==' | '!=') ;
Comment : '//' .*? '\n' -> skip;
fragment NUM : [0-9];
INT : NUM+;
REAL
: NUM* '.' NUM+
| '.' NUM+
| INT
;
ID : [a-zA-Z_] [a-zA-Z_0-9]*;
WS : [ \t\r\n]+ -> skip;
Using the input
z = x + y;
everything is fine, we get a parse tree which goes from cscript to statement_list, statement, assignment_statement, id and expression. Great!
Now, if I add the possibility to declare variables, all goes down the drain:
This is the change to the grammar:
cscript
: (statement_list | variable_declaration ';')* ;
variable_declaration
: type ID ('=' expression)?
;
type
: 'int'
| 'real'
;
statement_list
: statement ';' statement_list?
| block
;
statement
: assignment_statement
;
// (continue as before)
All of a sudden, the same test-input gets wrongly dissected into two statement_lists, each continued to a statement with a "missing ';'" warning, the first going to an incomplete assignment_statement of "z =" and the second to an incomplete assignment_statement "x +".
My attempt to show the parse tree in text-form:
cscript
statement_list
statement
assignment_statement
'z'
'=' [marked as error]
[warning: missing ';']
statement_list
statement
assignment_statement
'x'
'+' [marked as error]
'y' [marked as error]
';'
Can anyone tell me what the problem is? (And how to fix it? ;-))
Edit on 2016-12-26, after Mike's comment:
After replacing all implicit lexer rules with explicit declarations, all of a sudden, the input "z = x + y" worked. (thumbs up)
The next thing I did was restoring more of the original example I had in mind, and adding a new input line
int x = 22;
to the input (which worked previously, but did not make it into the minimal example). Now, that line fails. This is the -token output of the test rig:
[@0,0:2='int',<4>,1:0]
[@1,4:4='x',<22>,1:4]
[@2,6:6='=',<1>,1:6]
[@3,8:9='22',<20>,1:8]
[@4,10:10=';',<12>,1:10]
[@5,13:13='z',<22>,2:0]
[@6,15:15='=',<1>,2:2]
[@7,17:17='x',<22>,2:4]
[@8,19:19='+',<18>,2:6]
[@9,21:21='y',<22>,2:8]
[@10,22:22=';',<12>,2:9]
[@11,25:24='<EOF>',<-1>,3:0]
line 1:6 mismatched input '=' expecting '='
As the problem seemed to be in the variable_declaration part, I even tried to split this into two parsing rules like this:
cscript
: (statement_list | variable_declaration_and_assignment SEMICOLON | variable_declaration SEMICOLON)* ;
variable_declaration_and_assignment
: type ID EQUAL expression
;
variable_declaration
: type ID
;
With the result:
line 1:6 no viable alternative at input 'intx='
Still stuck :-( BTW: Splitting the "int x = 22;" into "int x;" and "x = 22;" works. sigh
Edit on 2016-12-26, after Mike's next comment:
Double-checked, and everything is lexer rules. Still, the mismatch between '=' and '=' (which I unfortunately cannot reconstruct anymore) gave me the idea to check the token types. The current status is:
(Shortened grammar)
cscript
: (statement_list | variable_declaration)* ;
...
variable_declaration
: type ID (EQUAL expression)? SEMICOLON
;
...
Assignment_operator : (EQUAL | PLUS_EQ | MINUS_EQ) ;
// among others
PLUS_EQ : '+=';
MINUS_EQ : '-=';
EQUAL: '=';
...
Shortened output:
[@0,0:2='int',<4>,1:0]
[@1,4:4='x',<22>,1:4]
[@2,6:6='=',<1>,1:6]
...
line 1:6 mismatched input '=' expecting ';'
Here, if I understand this correctly, the '=' is parsed to token type 1, which - according to the lexer.tokens output - is Assignment_Operator, while the expected EQUAL would be 13.
Might this be the problem?
stream.fill()then iterate overstream.getTokens();with a token.toString() call). What tokens do you see in this list? - Mike Lischkeline 1:6 mismatched input '=' expecting '='is usually a sign that there are multiple '=' definitions (e.g. literals inlined in parser rules). Double check your token declarations. Are all token literals covered by lexer rules instead of being lazily used directly in a parser rule? - Mike Lischke