0
votes

What is the proper lexer rule in ANTLR4 to match an arbitrary string until the stream contains a certain multi-character string ?

E.g. in the CharStream I have:

#integer12314#end
#freetextFoo bar#end

I would like to create a token from Foo bar of token type TEXT.

  • Every entry is closed with the #end tag.
  • TEXT consists of [\u001-\u007f]*, but let's forget about whitespace interaction for now.
  • TEXT can contain #, #e, #en.

From the CharStream above I would expect the token stream of:

tokenOf(#integer) Integer tokenOf(#end) tokenOf(#freetext) TEXT tokenOf(#end)

Obviously I can try to address this in the following way in the lexer grammar:

TEXT : [\u0001-\u007f]+? '#end'

but it will also contain the end tag and the parser grammar is uglier.

(Bonus questions:

  • how to also properly capture whitespace inside TEXT, but probably lexer modes to the rescue;
  • how to avoid interference from Identifier : [a-zA-Z_[a-zA-Z0-9_$]* and other lexer definitions. )
2

2 Answers

1
votes

Edited Any attempt to put a + in a lexer rule, such as

TEXT : (NOT_END1 ...)+ ;
fragment NOT_END1 : [\u0001-"$-\u007f] ;

consumes too much.

See Bart's answer here for the use of OTHER : . ;

Using this file input.txt :

#integer12314#end
#freetext x'010203' #end
#freetext##end
#freetext#e#end
#freetext#en e n d # en nd##end
#freetext#e x'040506' #en  #end

where I have inserted 010203 and 040506 using this editor :

00000000  23 69 6e 74 65 67 65 72  31 32 33 31 34 23 65 6e  |#integer12314#en|
00000010  64 0a 23 66 72 65 65 74  65 78 74 01 02 03 23 65  |d.#freetext...#e|
00000020  6e 64 0a 23 66 72 65 65  74 65 78 74 23 23 65 6e  |nd.#freetext##en|
00000030  64 0a 23 66 72 65 65 74  65 78 74 23 65 23 65 6e  |d.#freetext#e#en|
00000040  64 0a 23 66 72 65 65 74  65 78 74 23 65 6e 20 65  |d.#freetext#en e|
00000050  20 6e 20 64 20 23 20 65  6e 20 6e 64 23 23 65 6e  | n d # en nd##en|
00000060  64 0a 23 66 72 65 65 74  65 78 74 23 65 20 04 05  |d.#freetext#e ..|
00000070  06 23 65 6e 20 20 23 65  6e 64 0a                 |.#en  #end.|
0000007b

File Question_any.g4 :

grammar Question_any;

prog
@init {System.out.println("Question_any last update 0901");}
    :   ( line
            {System.out.println("Found line " + $line.source_line + " `" + $line.text + "`");}
        )+ EOF
    ;

line returns [int source_line]
@init {$source_line = getCurrentToken().getLine();}
    :   SHARP_INT INTEGER SHARP_END
    |   SHARP_FREE ANY+ SHARP_END
    ;

SHARP_INT  : '#integer' ;
SHARP_FREE : '#freetext' ;
SHARP_END  : '#end' ;
INTEGER    : [0-9]+ ;
NL         : [\r\n]+ -> skip ;
WS         : [ \t]+ -> channel(HIDDEN) ;

ANY        : [\u0001-\u007f] ; // must be after WS

Execution :

$ grun Question_any prog -tokens input.txt 
[@0,0:7='#integer',<'#integer'>,1:0]
[@1,8:12='12314',<INTEGER>,1:8]
[@2,13:16='#end',<'#end'>,1:13]
[@3,18:26='#freetext',<'#freetext'>,2:0]
[@4,27:27='',<ANY>,2:9]
[@5,28:28='',<ANY>,2:10]
[@6,29:29='',<ANY>,2:11]
[@7,30:33='#end',<'#end'>,2:12]
...
[@35,98:106='#freetext',<'#freetext'>,6:0]
[@36,107:107='#',<ANY>,6:9]
[@37,108:108='e',<ANY>,6:10]
[@38,109:109=' ',<WS>,channel=1,6:11]
[@39,110:110='',<ANY>,6:12]
[@40,111:111='',<ANY>,6:13]
[@41,112:112='',<ANY>,6:14]
[@42,113:113='#',<ANY>,6:15]
[@43,114:114='e',<ANY>,6:16]
[@44,115:115='n',<ANY>,6:17]
[@45,116:117='  ',<WS>,channel=1,6:18]
[@46,118:121='#end',<'#end'>,6:20]
[@47,123:122='<EOF>',<EOF>,7:0]
Question_any last update 0901
Found line 1 `#integer12314#end`
Found line 2 `#freetext#end`
Found line 3 `#freetext##end`
Found line 4 `#freetext#e#end`
Found line 5 `#freetext#en e n d # en nd##end`
Found line 6 `#freetext#e #en  #end`

The special characters are not printed.

0
votes

As a temporary solution I chose to put all the non-ends into the Lexer rule:

TEXT : (NOT_END1 | NOT_END2 | NOT_END3 | NOT_END4)+ ;

fragment NOT_END1 :       [\u0001-"$-\u007f] ;  // # is between # and $ in ASCII
fragment NOT_END2 : '#'   [\u0001-df-\u007f] ;  // e is between d and f
fragment NOT_END3 : '#e'  [\u0001-mo-\u007f] ;  // n is between m and o
fragment NOT_END4 : '#en' [\u0001-ce-\u007f] ;  // d is between c and e

END : '#end'

As this is ugly as hell and I feel bad about this shameful act :-), I hope there are more elegant solutions.