Parsing Decaf grammar in Antlr4

Question

I am creating parser and lexer rules for Decaf programming language written in ANTLR4. There is a parser test file I am trying to run to get the parser tree for it by printing the visited nodes on the terminal window and paste them into D3_parser_tree.html class. The current parser tree is missing the right square brackets with the number 10 according to this testing file : class program { int i [10]; }

The error I am getting : mismatched input '10' expecting INT_LITERAL

I am not sure why I am getting this error although I have declared a lexer rule for INT_LITERAL and then called it in a parser rule within field_decl according to the given Decaf spec :

** Parser rules **

<program> → class Program ‘{‘ <field_decl>* <method_decl>* ‘}’
<field_decl> → <type> { <id> | <id> ‘[‘ <int_literal> ‘]’ }+, ;
<method_decl> → { <type> | void } <id> ( [ { <type> <id> }+, ] ) <block>
<digit> → 0 | 1 | 2 | … | 9
<block> → ‘{‘ <var_decl>* <statement>* ‘}’
<literal> → <int_literal> | <char_literal> | <bool_literal>
<hex_digit> → <digit> | a | b | c | … | f | A | B | C | … | F
<int_literal> → <decimal_literal> | <hex_literal>
<decimal_literal> → <digit> <digit>*
<hex_literal> → 0x <hex_digit> <hex_digit>*

Related Lexer rules :

NUMBER : [0-9]+;
fragment ALPHA : [_a-zA-Z0-9];
fragment DIGIT : [0-9];
fragment DECIMAL_LITERAL : DIGIT+;
CHAR_LITERAL : '\'' CHAR '\'';
STRING_LITERAL : '"' CHAR+ '"' ;
COMMENT : '//' ~('\n')* '\n' -> skip;
WS : (' ' | '\n' | '\t' | '\r') + -> skip;

Related Parser rules :

program : CLASS VAR LCURLYBRACE field_decl*method_decl* RCURLYBRACE EOF;
field_decl : data_type field ( COMMA field )* SEMICOLON;

Please let me know if you need further details & I appreciate your help a lot.

You'll need to post enough of your grammar file to reproduce the error. The most likely cause of your problem is that your INT_LITERAL rule conflicts with another rule that you did not show us. It would also be helpful if you printed out which tokens are being generated for your input, specifically what kind of token '10' is recognized as. — sepp2k
Yeah, what sepp2k said. Easiest is to edit your question and add you entire ANTLR grammar to it. — Bart Kiers
The generated tokens are: CLASS - VAR - LCURLYBRACE-INT-VAR- LSQUAREBRACE - NUMBER - RSQUAREBRACE - SEMICOLON-RCURLYBRACE. Seems good here in lexer rules and number 10 is recognised as NUMBER. I will add some grammar rules — Khaled Salem
@KhaledSalem I don't see a definition of NUMBER in your edited rules, but clearly that's where the problem lies. From the name I'm guessing that NUMBER and INT_LITERAL overlap heavily. — sepp2k
Easiest is to edit your question and add your entire ANTLR grammar to it. — Bart Kiers

Bart Kiers Bart Kiers · Accepted Answer · 2020-04-12T20:25:46

The following rules conflict:

VAR : ALPHA+;
...
NUMBER : [0-9]+;
...
INT_LITERAL : DECIMAL_LITERAL | HEX_LITERAL;

They all match 10, but the lexer will always choose VAR since that is the rule defined first.

This is just how ANTLR's lexer works: it tries to match the most characters as possible, and when two (or more) rules all match the same amount of characters, the one defined first "wins".

You will see that it parses correctly if you change field into:

field : VAR | VAR LSQUAREBRACE VAR RSQUAREBRACE;

Parsing Decaf grammar in Antlr4

1 Answers