0
votes

I have a grammar like this :

grammar MyGrammar;

field  : f1 (STROKE f2 f3)? ;

f1 : FIELDTEXT+ ;
f2 : 'A' ;
f3 : NUMBER4 ; 

FIELDTEXT    : ~['/'] ;
NUMBER4  : [0-9][0-9][0-9][0-9];
STROKE : '/' ;

This works well enough, and fields f1 f2 f3 are all populated correctly.

Except when there is an A to the left of the /, (regardless of the presence of the optional part) this additionally causes an error:

extraneous input 'A' expecting {<EOF>, FIELDTEXT, '/'}

Some sample Data:

PHOEN

-> OK.

KLM405/A4046

-> OK.

SAW502A

-> Not OK, 'A' is in f1.

BAW617/A5136

-> Not OK, 'A' is in f1.

I am not understanding why 'A' is a problem here (the fields are still populated).

3
Please give the input you are parsing.BernardK
@BernardK - inputs added.NWS

3 Answers

1
votes

The problem with SAW502A is that 'A' is a separate token, implicitly defined :

f2 : 'A' ;

(it would be the same if it were explicitly defined) :

[@16,19:19='S',<FIELDTEXT>,3:0]
[@17,20:20='A',<'A'>,3:1]
[@18,21:21='W',<FIELDTEXT>,3:2]
[@19,22:22='5',<FIELDTEXT>,3:3]
[@20,23:23='0',<FIELDTEXT>,3:4]
[@21,24:24='2',<FIELDTEXT>,3:5]
[@22,25:25='A',<'A'>,3:6]
[@23,26:26='\n',<FIELDTEXT>,3:7]

and the rule f1 does not allow anything else than FIELDTEXT. It works with :

f1 : ( FIELDTEXT | 'A' )+ ;

File Question.g4 :

grammar Question;

question
@init {System.out.println("Question last update 2305");}
    : line+ EOF
    ;
line
    : f1 (STROKE f2 f3)? NL
      {System.out.println("f1=" + $f1.text + " f2=" + $f2.text + " f3=" + $f3.text);}
    ;

f1 : ( FIELDTEXT | 'A' )+ ;
f2 : 'A' ;
f3 : NUMBER4 ; 

NUMBER4   : [0-9][0-9][0-9][0-9] ;
STROKE    : '/' ;
NL        : [\r\n]+ ; // -> channel(HIDDEN) ;
WS        : [ \t]+ -> skip ;
FIELDTEXT : ~[/] ;

Input file t.text :

PHOEN
KLM405/A4046
SAW502A
BAW617/A5136

Execution :

$ grun Question question -tokens -diagnostics t.text
[@0,0:0='P',<FIELDTEXT>,1:0]
[@1,1:1='H',<FIELDTEXT>,1:1]
[@2,2:2='O',<FIELDTEXT>,1:2]
[@3,3:3='E',<FIELDTEXT>,1:3]
[@4,4:4='N',<FIELDTEXT>,1:4]
[@5,5:5='\n',<NL>,1:5]
[@6,6:6='K',<FIELDTEXT>,2:0]
[@7,7:7='L',<FIELDTEXT>,2:1]
[@8,8:8='M',<FIELDTEXT>,2:2]
[@9,9:9='4',<FIELDTEXT>,2:3]
[@10,10:10='0',<FIELDTEXT>,2:4]
[@11,11:11='5',<FIELDTEXT>,2:5]
[@12,12:12='/',<'/'>,2:6]
[@13,13:13='A',<'A'>,2:7]
[@14,14:17='4046',<NUMBER4>,2:8]
[@15,18:18='\n',<NL>,2:12]
[@16,19:19='S',<FIELDTEXT>,3:0]
[@17,20:20='A',<'A'>,3:1]
[@18,21:21='W',<FIELDTEXT>,3:2]
[@19,22:22='5',<FIELDTEXT>,3:3]
[@20,23:23='0',<FIELDTEXT>,3:4]
[@21,24:24='2',<FIELDTEXT>,3:5]
[@22,25:25='A',<'A'>,3:6]
[@23,26:26='\n',<NL>,3:7]
[@24,27:27='B',<FIELDTEXT>,4:0]
[@25,28:28='A',<'A'>,4:1]
[@26,29:29='W',<FIELDTEXT>,4:2]
[@27,30:30='6',<FIELDTEXT>,4:3]
[@28,31:31='1',<FIELDTEXT>,4:4]
[@29,32:32='7',<FIELDTEXT>,4:5]
[@30,33:33='/',<'/'>,4:6]
[@31,34:34='A',<'A'>,4:7]
[@32,35:38='5136',<NUMBER4>,4:8]
[@33,39:39='\n',<NL>,4:12]
[@34,40:39='<EOF>',<EOF>,5:0]
Question last update 2305
f1=PHOEN f2=null f3=null
f1=KLM405 f2=A f3=4046
f1=SAW502A f2=null f3=null
f1=BAW617 f2=A f3=5136
1
votes

The input SAW502A will be tokenized as six FIELDTEXTs, followed by one 'A' token. That's a problem because 'A' tokens aren't allowed at that position - only FIELDTEXT tokens are. Clearly you intended A to be a FIELDTEXT in this context as well (and only be treated differently in the f2 rule), but the tokenizer doesn't know which kind of token is required by the grammar at a certain point - it only knows the token rules and generates whichever token is the best fit. So whenever it sees an A, it generates an 'A' token.

Note that this also means that whenever it sees four consecutive digits, it generates NUMBER4 token. So if your input were SAW5023, you'd get an error because of an unexpected NUMBER4 token.

You can fix the issue with the A by introducing a everythingButAStroke non-terminal rule that can be either a FIELDTEXT, an 'A' or a NUMBER4, but this wouldn't solve the NUMBER4 issue. And whenever you add a new token rule, you add that one to everythingButAStroke as well. But that's not a very good solution. For one, it will get less manageable the more token rules you add. And for another, you clearly intended f1 to be a list of single characters, but now NUMBER4 tokens, which have four characters, would be there as well, which would be weird and inconsistent.

It seems to me that your whole field rule could be a single terminal rule (ideally separated into fragments for readability) instead of using non-terminal rules like this. That way you would have no problems with overlapping terminal rules.

1
votes

I have often experienced that a negating lexer rule makes it hard to define other lexer rules, so I prefer to avoid them. It seems that a /, if present, is always followed by an A. Therefore I have another solution.

File Question_x.g4 :

grammar Question_x;

question
@init {System.out.println("Question last update 0112");}
    : line+ EOF
    ;

line
    : f1 ( f2s='/A' f3 )? NL
      { String f2 = _localctx.f2s != null ? _localctx.f2s.getText().substring(1) : null;
        System.out.println("f1=" + $f1.text + " f2=" + f2 + " f3=" + $f3.text);}
    ;

f1 : ALPHANUM | NUMBER4 ;
f3 : NUMBER4 ; 

NUMBER4   : [0-9][0-9][0-9][0-9] ;
ALPHANUM  : [a-zA-Z0-9]+ ;
NL        : [\r\n]+ ; // -> channel(HIDDEN) ;
WS        : [ \t]+ -> skip ;

Input file t.text :

PHOEN
KLM405/A4046
SAW502A
BAW617/A5136
SAW5023
1234/A1234

Execution :

$ grun Question_x question -tokens -diagnostics t.text
[@0,0:4='PHOEN',<ALPHANUM>,1:0]
[@1,5:5='\n',<NL>,1:5]
[@2,6:11='KLM405',<ALPHANUM>,2:0]
[@3,12:13='/A',<'/A'>,2:6]
[@4,14:17='4046',<NUMBER4>,2:8]
[@5,18:18='\n',<NL>,2:12]
[@6,19:25='SAW502A',<ALPHANUM>,3:0]
[@7,26:26='\n',<NL>,3:7]
[@8,27:32='BAW617',<ALPHANUM>,4:0]
[@9,33:34='/A',<'/A'>,4:6]
[@10,35:38='5136',<NUMBER4>,4:8]
[@11,39:39='\n',<NL>,4:12]
[@12,40:46='SAW5023',<ALPHANUM>,5:0]
[@13,47:47='\n',<NL>,5:7]
[@14,48:51='1234',<NUMBER4>,6:0]
[@15,52:53='/A',<'/A'>,6:4]
[@16,54:57='1234',<NUMBER4>,6:6]
[@17,58:58='\n',<NL>,6:10]
[@18,59:58='<EOF>',<EOF>,7:0]
Question last update 0112
f1=PHOEN f2=null f3=null
f1=KLM405 f2=A f3=4046
f1=SAW502A f2=null f3=null
f1=BAW617 f2=A f3=5136
f1=SAW5023 f2=null f3=null
f1=1234 f2=A f3=1234