The lexer chooses the wrong Token

Question

Hi I am new to antrl and have a problem that I am not able to solve during the last days:

I wanted to write a grammar that recognizes this text (in reality I want to parse something different, but for the case of this question I simplified it)

100abc
150100
200def

Here each rows starts with 3 digits, that identifiy the type of the line (header, content, trailer), than 3 characters follow, that are the payload of the line.

I thought I could recogize this with this grammar:

grammar Types;

file : header content trailer;

A : [a-z|A-Z|0-9];
NL: '\n';

header : '100' A A A NL;
content: '150' A A A NL;
trailer: '200' A A A NL;

But this does not work. When the lexer reads the "100" in the second line ("150100") it reads it into one token with 100 as the value and not as three Tokens of type A. So the parser sees a "100" token where it expects an A Token.

I am pretty sure that this happens because the Lexer wants to match the longest phrase for one Token, so it cluster together the '1','0','0'. I found no way to solve this. Putting the Rule A above the parser Rule that contains the string literal "100" did not work. And also factoring the '100' into a fragement as follows did not work.

grammar Types;

file : header content trailer;

A : [a-z|A-Z|0-9];
NL: '\n';
HUNDRED: '100';

header :  HUNDRED A A A NL;
content: '150' A A A NL;
trailer: '200' A A A NL;

I also read some other posts like this:

antlr4 mixed fragments in tokens

Lexer, overlapping rule, but want the shorter match

But I did not think, that it solves my problem, or at least I don't see how that could help me.

Ivan Kochurkin Ivan Kochurkin · Accepted Answer · 2016-12-27T13:41:00

One of your token definitions is incorrect: A : [a-z|A-Z|0-9]; Don't use a vertical line inside a range [] set. A correct definition is: A : [a-zA-Z0-9];. ANTLR with version >= 4.6 will notify about duplicated chars | inside range set.
As I understand you mixed tokens and rules concept. Tokens defined with UPPER first letter unlike rules that defined with lower case first letter. Your header, content and trailer are tokens, not rules.

So, the final version of correct grammar on my opinion is

grammar Types;

file : Header Content Trailer;

A : [a-zA-Z0-9];
NL: '\r' '\n'? | '\n' | EOF; // Or leave only one type of newline. 

Header :  '100' A A A NL;
Content: '150' A A A NL;
Trailer: '200' A A A NL;

Your input text will be parsed to (file 100abc\n 150100\n 200def)

The lexer chooses the wrong Token

1 Answers