How to make ANTLR4 grammar support escaping characters?

Question

I have the following ANTLR4 grammar to interpret regular expressions.

// Regular Expression Grammar.
grammar RegExpr;

program  : expr EOF # Root
        ;
expr     : TERM # TermNode
        | expr '?' # OptionalNode
        | '(' expr ')' # OrdinaryNode
        | expr expr # ConcatNode
        | expr '|' expr # OrNode
        ;
ESC      : '\\' . ;
TERM     : ([a-zA-Z0-9,.*^+\-&'":><#![\]] | ESC)+ ;
WS       : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines

However, when I try to parse string literal '\\(' in Java, I got

line 1:0 no viable alternative at input '\('

I want to treat any character with '\\' prefix as terminals. For example, '\\(', '\\)', '\\\\', '\\X' are treated as terminals.

In the end, I want to parse '\(a.(b|c)\)' as

'\(a.' (b|c) '\)'

which represents '\(a.b\)' and '\(a.c\)'. Then I can remove all '\'s to get '(a.b)' and '(a.c)'.

Can anyone please point out why does the above grammar gives errors on '\\(' and '\(a.(b|c)\)'?

Thanks!

Raven Raven · Accepted Answer · 2018-01-10T10:44:06

The original question already has an answer (using fragment) however I think there is still a lack of understanding. So here's an explanation:

In ANTLR lexer rules are being processed in the order they are specified in the grammar. So ANTLR will start with the first rule and tried if it can match the current input character-sequence. If it can a token will be created and the process starts all over again. If it can't the next lexer rule is consulted.

In your example ESC is specified before TERM. Therefore ANTLR will try to match the input as ESC before it will try to match it as TERM. Thus the input \. will always be matched as a single ESC token and only the following characters (that no longer match ESC) will be matched with TERM.

By defining ESC to be a fragment you are telling ANTLR that ESCisn't a lexer rule by itself. Therefore it won't be asked to match the character inputStream. Fragments are just reusable parts that can be used to assemble actual lexer rules and therefore the first declared (and consulted) lexer rule in your grammar becomes TERM.
In fact the only advantage of using fragments is if you have multiple lexer rules that at some point all contain the same sequence (e.g. '\\' .). In order to not have to write that sequence every time you can pre-define that sequence as a fragment. So basically you can think of fragments as a sort of variable holding the actual sequence that can be inserted into lexer rules.

Long story short: The problem was solved because fragments will not create tokens while normal lexer rules will.

How to make ANTLR4 grammar support escaping characters?

3 Answers