1
votes

I'm quite new to ANTLR and I need an explanation about its behaviour when recognising input string. From what I understood, rules staring by an uppercase letter are lexer rules while the ones starting by lowercase letter are parser rules.
I have the following language and I need to write the grammar accordingly:
language
So L should accept all the strings composed by:

  • non empty string with b and/or c
  • d
  • non empty string with a and/or c

This is the first grammar I wrote and I don't have problems with it when recognising the input bbbcbcbcbcdaaacacacaaaa:

start   : (alfa 'd' beta);

alfa    : ('b'|'c') | ('b'|'c') alfa;
beta    : ('a'|'c') | ('a'|'c') beta;

WS      : (' '|'\n'|'\t'|'\r')->skip;

But if I change it like the following one, the same string is not recognised anymore (below you can see the debugger result of ANTLRWorks):

enter image description here

start   : (alfa 'd' beta);

alfa    : ALFA_VALS | ALFA_VALS alfa;
beta    : BETA_VALS | BETA_VALS beta;

ALFA_VALS: ('b'|'c');
BETA_VALS: ('a'|'c');
WS      : (' '|'\n'|'\t'|'\r')->skip;

Furthermore, if I change ALFA_VALS and BETA_VALS to alfa_vals and beta_vals respectively, no problems occur.

Could someone provide me an explanation about this behaviour? Because I couldn't find a specific one addressing this problem.
Thank you so much!

2
I recommend you use ANTLR v4 instead of the old v3 you're using now. - Bart Kiers

2 Answers

2
votes

The ANTLR lexer matches the longest possible sub-sequence of input, or if it can match the same input using multiple lexer rules, it uses the first rule that matches.

The lexer knows no context and no parser state, and decides based solely on the characters on input.

If you define ALFA_VALS and BETA_VALS in lexer this way, the input 'c' will always be matched as ALFA_VALS token.

1
votes

May I suggest a slightly different approach? Your right recursive definition is rather untypical for ANTLR (while very often used in bottom-up parsers). Instead use the simpler expressions as defined in your language:

grammar Example;

start   : (α D β) EOF;

α: (B | C)+;
β: (A | C)+;

A: 'a';
B: 'b';
C: 'c';
D: 'd';

WS: (' '|'\n'|'\t'|'\r') -> skip;

Which gives you this simpler and more natural parse tree:

enter image description here