Antlr: how to match everything between the other recognized tokens?

Question

How do I match all of the leftover text between the other tokens in my lexer?

Here's my code:

grammar UserQuery;

expr:  expr AND expr
    | expr OR expr
    | NOT expr
    | TEXT+
    | '(' expr ')'
    ;

OR  :    'OR';
AND :    'AND';
NOT :    'NOT';
LPAREN : '(';
RPAREN : ')';

TEXT: .+?;

When I run the lexer on "xx AND yy", I get these tokens:

x type:TEXT
x type:TEXT
  type:TEXT
AND type:'AND'
  type:TEXT
y type:TEXT
y type:TEXT

This sort-of works, except that I don't want each character to be a token. I'd like to consolidate all of the leftover text into a single TEXT token.

TomServo TomServo · Accepted Answer · 2017-08-28T12:02:18

I don't think this is possible without a delimiter, otherwise the greedy (?) lexer token will match all your input, including your explicit tokens, on the principle that longest match wins with lexer tokens.

Now, if you can accept that a delimiter is needed to delineate the text, and the addition of a simple whitespace rule to handle the spaces in between, then you get something like this:

[@0,0:14=''longest token'',<TEXT>,1:0]
[@1,16:18='AND',<'AND'>,1:16]
[@2,20:23=''yy'',<TEXT>,1:20]
[@3,24:23='<EOF>',<EOF>,1:24]

From this grammar:

grammar UserQuery;

expr:  expr AND expr
    | expr OR expr
    | NOT expr
    | TEXT
    | '(' expr ')'
    ;

OR  :    'OR';
AND :    'AND';
NOT :    'NOT';
LPAREN : '(';
RPAREN : ')';

TEXT : '\'' .*? '\'';
WS: [ \t\r\n] -> skip;

Using this input:

'longest token' AND 'yy'

It's very similar to the way comments and strings are often handled in programming languages, where there's a starting and ending delimiter and everything in between is tokenized as one big token. Often with comments we'd discard them, but here we keep them as we would a string. Hope this helps.

Antlr: how to match everything between the other recognized tokens?

1 Answers