1
votes

I'm trying to understand how ANTLR grammars work and I've come across a situation where it behaves unexpectedly and I can't explain why or figure out how to fix it.

Here's the example:

root : title '\n' fields EOF;
title : STR;
fields : field_1 field_2;
field_1 : 'a' | 'b' | 'c';
field_2 : 'd' | 'e' | 'f';

STR : [a-z]+;

There are two parts:

  1. A title that is a lowercase string with no special characters
  2. A two character string representing a set of possible configurations

When I go to test the grammar, here's what happens: first I write the title and, on a new line, give the character for the first field. So far so good. The parse tree looks as I would expect up to this point. enter image description here

When I add the next field is when the problem comes up. ANTLR decides to reinterpret the line as an instance of STR instead of a concatenation of the fields that I was expecting. enter image description here

I do not understand why ANTLR tries to force an unrelated terminal expression when it wasn't specified as an option by the grammar. Shouldn't it know to only look for characters matching the field rules since it is descended from the fields node in the parse tree? What's going on here and how do I write my ANTLR grammars so they don't have this problem?

I've read that ANTLR tries to match the format greedily from the top of the grammar to the bottom, but this doesn't explain why this is happening because the STR terminal is the very last line in the file. If ANTLR gives special precedence to matching terminals, how do I format the grammar so that it interprets it properly? As far as I understand, regexes do not work for non-terminals so it seems that have to define it how it is now.

A note of clarification: this is just an example of a possible grammar that I'm trying to make work with the text format as is, so I'm not looking for answers like adding a space between the fields or changing the title to be uppercase.

1

1 Answers

1
votes

What I didn't understand before is that there are two steps in generating a parser:

  1. Scanning the input for a list of tokens using the lexer rules (uppercase statements) and then...
  2. Constructing a parse tree using the parser rules (lowercase statements) and generated tokens

My problem was that there was no way for ANTLR to know I wanted that specific string to be interpreted differently when it was generating the tokens. To fix this problem, I wrote a new lexer rule for the fields string so that it would be identifiable as a token. The key was making the FIELDS rule appear before the STR rule because ANTLR checks them in the order they appear.

root : title FIELDS EOF;
title : STR;

FIELDS : [a-c] [d-f];
STR : [a-z]+;

Note: I had to bite the bullet and read the ANTLR Mega Tutorial to figure this out.