3
votes

I am working on a Regx parser for RegEx inside XSD. My previous problem was descrived here: ANTLR4 parsing RegEx

I have split the Lexer and Parser since than. Now I have a problem parsing parantheses inside brackets. They should be treated as characters inside the brackets and as grouping tokens outside. This is my lexer grammar:

lexer grammar RegExLexer;

Char    : ALPHA ;
Int     : DIGIT ;

LBrack  : '[' ;//-> pushMode(modeRange) ;
RBrack  : ']' ;//-> popMode ;
LBrace  : '(' ;
RBrace  : ')' ;
Semi    : ';' ;
Comma   : ',' ;
Asterisk: '*' ;
Plus    : '+' ;
Dot     : '.' ;
Dash    : '-' ;
Question: '?' ;
LCBrace : '{' ;
RCBrace : '}' ;
Pipe    : '|' ;
Esc     : '\\' ;

WS : [ \t\r\n]+ -> skip ;

fragment DIGIT : [0-9] ;
fragment ALPHA : [a-zA-Z] ;

And here is the example:

[0-9a-z()]+

I feel like i should use modes on brackets to change the behaviour of ALPHA fragment. If I copy the fragment, I get an error saying I can't have the declaration twice. I have read the reference about this and I still don't get what i should do.

How do I implement the modes?

2
What about CharSet : '[''^'?']'?'-'?([^]\\-]'-'[^]\\-]|[^]\\-]|ESCAPE_SEQUENCE)*'-'?']'programmerjake
I just don't know what you meant and houw to use this.user1941235
it's a token for antlr that is a character class.programmerjake
@programmerjake I too have no idea what you're talking about :)Bart Kiers
@Bart for instance, it would match and return as a token [^-a-zA-Z_]programmerjake

2 Answers

2
votes

You're going to have to handle this in the parser, not the lexer. When lexer sees a '(', it will return token LBrace. For lexer, there is no context as to where token is seen. It simply carves up the input into tokens. You will have to define parse rules and when processing parse tree, you can then determine was the LBrace inside brackets or not.

4
votes

Here's a quick demo of how it is possible to create a context sensitive lexer using ANTLR4's lexical-modes:

lexer grammar RegexLexer;

START_CHAR_CLASS
 : '[' -> pushMode(CharClass)
 ;

START_GROUP
 : '('
 ;

END_GROUP
 : ')'
 ;

PLAIN_ATOM
 : ~[()\[\]]
 ;

mode CharClass;

END_CHAR_CLASS
 : ']' -> popMode
 ;

CHAR_CLASS_ATOM
 : ~[\r\n\\\]]
 | '\\' .
 ;

After generating the lexer, you can use the following class to test it:

import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;

public class Main {
    public static void main(String[] args) {
        RegexLexer lexer = new RegexLexer(new ANTLRInputStream("([()\\]])"));
        for (Token token : lexer.getAllTokens()) {
            System.out.printf("%-20s %s\n", RegexLexer.VOCABULARY.getSymbolicName(token.getType()), token.getText());
        }
    }
}

And if you run this Main class, the follwoing will be printed to your console:

START_GROUP          (
START_CHAR_CLASS     [
CHAR_CLASS_ATOM      (
CHAR_CLASS_ATOM      )
CHAR_CLASS_ATOM      \]
END_CHAR_CLASS       ]
END_GROUP            )

As you can see, the ( and ) are tokenized differently outside the character class as they are inside of it.