2
votes

If I have an ANTLR grammar as follows:

grammar Test;
options {
  language = Java;
}

rule : (foo | bar);


foo : FOO ',' FOO;   
bar : BAR; 

FOO: ('0'..'9')+;
BAR: ('a'..'z' | 'A'..'Z' | '0'..'9' | ' ')+;
WHITESPACE: (' ' | '\t')+ { $channel=HIDDEN; };

And I use a test string:

12abc3

this (I believe) is a BAR token which satisfies a bar rule and is parsed as such. Bravo.

However, if I have this string:

12

I receive line 1:2 mismatched input '' expecting ','

This seems rather non-deterministic although I'm sure it's not. I understand I'm already in trouble by having two tokens: FOO and BAR that accept digits. But if the parser is going to succeed or fail it should succeed or fail consistently. In other words, in the first case the first character is a 1 and apparently is being evaluated as a member of the BAR token and thus the parser heads down a successful path. In the second case, the SAME first character is being evaluated as a FOO token and thus the path is doomed to fail despite the fact that the string COULD be a successful bar parse. Why the inconsistency? Or am I missing something more fundamental about ANTLR and/or parsing?

2

2 Answers

3
votes

ANTLR doesn't determine the token type until it sees the first character for the next token(or EOF). ANTLR will also attempt a longest match, which is why you see '12abc3' as BAR and not as FOO BAR. In the second case ANTLR will use FOO for '12' because it is listed first in the grammar.

ANTLR basics

ANTLR lexers

0
votes

In addition to Adam answer, you must realize that the lexer and parser, although defined in the same grammar, are being constructed at different times. First the input source is being tokenized and when that has happened, only then the parser operates on these tokens. The tokens are not created while the parser goes through the source (character stream) to favor a complete match (ie. tokenize "12" as BAR). The fact that "12" is being tokenized as FOO is because FOO comes before the BAR rule and has therefor a higher precedence in case of an equal long match.

In short: ANTLR grammars are not PEG's.