ANTLR4 RegEx lexer modes

Question

I am working on a Regx parser for RegEx inside XSD. My previous problem was descrived here: ANTLR4 parsing RegEx

I have split the Lexer and Parser since than. Now I have a problem parsing parantheses inside brackets. They should be treated as characters inside the brackets and as grouping tokens outside. This is my lexer grammar:

lexer grammar RegExLexer;

Char    : ALPHA ;
Int     : DIGIT ;

LBrack  : '[' ;//-> pushMode(modeRange) ;
RBrack  : ']' ;//-> popMode ;
LBrace  : '(' ;
RBrace  : ')' ;
Semi    : ';' ;
Comma   : ',' ;
Asterisk: '*' ;
Plus    : '+' ;
Dot     : '.' ;
Dash    : '-' ;
Question: '?' ;
LCBrace : '{' ;
RCBrace : '}' ;
Pipe    : '|' ;
Esc     : '\\' ;

WS : [ \t\r\n]+ -> skip ;

fragment DIGIT : [0-9] ;
fragment ALPHA : [a-zA-Z] ;

And here is the example:

[0-9a-z()]+

I feel like i should use modes on brackets to change the behaviour of ALPHA fragment. If I copy the fragment, I get an error saying I can't have the declaration twice. I have read the reference about this and I still don't get what i should do.

How do I implement the modes?

What about CharSet : '[''^'?']'?'-'?([^]\\-]'-'[^]\\-]|[^]\\-]|ESCAPE_SEQUENCE)*'-'?']' — programmerjake
@programmerjake I too have no idea what you're talking about :) — Bart Kiers
@Bart for instance, it would match and return as a token [^-a-zA-Z_] — programmerjake

DBug DBug · Accepted Answer · 2015-12-11T18:01:52

You're going to have to handle this in the parser, not the lexer. When lexer sees a '(', it will return token LBrace. For lexer, there is no context as to where token is seen. It simply carves up the input into tokens. You will have to define parse rules and when processing parse tree, you can then determine was the LBrace inside brackets or not.

ANTLR4 RegEx lexer modes

2 Answers