1
votes

I have the following grammar:

grammar Token;

prog: (expr NL?)+ EOF;

expr: '[' type ']';

type : typeid ':' value;

typeid : 'TXT' | 'ENC' | 'USR';

value: Text | INT;

INT :   '0' | [1-9] [0-9]*;

//WS : [ \t]+;
WS  :   [ \t\n\r]+ -> skip ;
NL:  '\r'? '\n';
Text : ~[\]\[\n\r"]+ ;

and the text I need to parse is something like this below

[TXT:look at me!]
[USR:19700]
[TXT:, can I go there?]
[ENC:124124]
[TXT:this is needed for you to go...]

I need to split this text but I getting some errors when I run grun.bat Token prog -gui -trace -diagnostics

enter   prog, LT(1)=[
enter   expr, LT(1)=[
consume [@0,0:0='[',<3>,1:0] rule expr
enter   type, LT(1)=TXT:look at me!
enter   typeid, LT(1)=TXT:look at me!
line 1:1 mismatched input 'TXT:look at me!' expecting {'TXT', 'ENC', 'USR'}
... much more ...

enter image description here

what is wrong with my grammar? please, help me!

1
Text matches way too much. It matches 'TXT' for example. Try making it more specific. - Terence Parr
@TheANTLRGuy but I need to match any text between 'TXT' and ']', how can I make Text more specific for that? - thiagoh
Let TXT only capture a single character and leave it as the last rule. This way it will not match identifiers. Where you previously used TXT now use TXT+ instead. Note that it no longer will match whitespace! EDIT: @BartKiers already proposed exactly that! - Onur

1 Answers

1
votes

You must understand that the tokens are not created based on what the parser is trying to match. The lexer tries to match as much characters as possible (independently from that parser!): your Text token should be defined differently.

You could let the Text rule become a parser rule instead, and match single char tokens like this:

grammar Token;

prog   : expr+ EOF;
expr   : '[' type ']';
type   : typeid ':' value;
typeid : 'TXT' | 'ENC' | 'USR';
value  : text | INT;
text   : CHAR+;

INT  : '0' | [1-9] [0-9]*;
WS   : [ \t\n\r]+ -> skip ;
CHAR : ~[\[\]\r\n];