How to parse grammar of XSD Regex with ANTLR4?

Question

Dear Antlr4 community,

I recently started to use ANTLR4 to translate regular expression from XSD / xml to cvc4. I use the grammar as specified by w3c, see http://www.w3.org/TR/xmlschema11-2/#regexs . For this question I have simplified this grammar (by removing charClass) to:

grammar XSDRegExp;

regExp            :     branch ( '|' branch )* ;
branch            :     piece* ;
piece             :     atom quantifier? ;
quantifier        :     Quantifiers | '{'quantity'}' ;
quantity          :     quantRange | quantMin | QuantExact ;
quantRange        :     QuantExact ',' QuantExact ;
quantMin          :     QuantExact ',' ;
atom              :     NormalChar | '(' regExp ')' ;       // excluded | charClass  ;

QuantExact        :     [0-9]+ ;
NormalChar        :     ~[.\\?*+{}()|\[\]] ;        
Quantifiers       :     [?*+] ;

Parsing seems to go fine:

input    a(bd){6,7}c{14,15}

However, I get an error message for:

input    12{3,4}

The error is:

line 1:0 mismatched input '12' expecting {, '(', '|', NormalChar}

I understand that the Lexer could also see a QuantExact as the first symbol, but since the Parser is only looking for a NormalChar I did not expect this error.

I tried a number of changes:

[1] Swapping the definitions of QuantExact and NormalChar. But swapping introduces an error in the first input:

line 1:6 no viable alternative at input '6'

since in that case '6' is only seen as a NormalChar and NOT as a QuantExact.

[2] Try to make a context for QuantExact (the curly brackets of quantity), such that the lexer only provides the QuantExact symbols in this limited context. But I failed to find ANTLR4 primitives for this.

So nothing seems to work, therefore my question is: Can I parse this grammar with ANTLR4? And if so, how?

How confident are you that the . in the definition of NormalChar doesn't need to be escaped (I'm not an ANTLR user, and the documentation is a little vague)? Does the string 12 parse against the grammar as shown? (From your error message, I conjecture 'no'.) Does the string 'abc' parse? — C. M. Sperberg-McQueen
@C.M.Sperberg-McQueen, ANTLR4's character set (character class) behaves as one expects: only the \ and ] need to be escaped, other meta-char don't. — Bart Kiers
"As one expects"? My expectation would be that . needs escaping. You may have different expectations, of course, but anyone who has used more than two or three regular-expression tools will have learned that expectations are not nearly as useful as documentation. — C. M. Sperberg-McQueen
@C.M.Sperberg-McQueen, yes "as one expects" is IMO applicable here. In most modern programming languages, a DOT does not need to be escaped inside a character class to match the literal '.' instead of matching any char. This applies to Java, any .NET language, Perl, JS, Python, etc. Could you tell me why you expect it to need escaping? In which regex implementation does a DOT need to be escaped to only match the literal '.'? — Bart Kiers
I didn't say the expectation is correct, and I feel no need to attempt to justify the expectation; some people work with a broader range of regex implementations than you mention, and I find your assumption that all users have the same expectations you do a bit off-putting, that's all. No reason that should bother you. — C. M. Sperberg-McQueen

Bart Kiers Bart Kiers · Accepted Answer · 2014-06-13T18:01:06

I understand that the Lexer could also see a QuantExact as the first symbol, but since the Parser is only looking for a NormalChar I did not expect this error.

The lexer does not "listen" to the parser: no matter if the parser is trying to match a NormalChar, the characters 12 will always be matched as a QuantExact. The lexer tries to match as much characters as possible, and in case of a tie, it chooses the rule defined first.

You could introduce a normalChar rule that matches both a NormalChar and QuantExact and use that rule in your atom:

atom              :     normalChar | '(' regExp ')' ;
normalChar        :     NormalChar | QuantExact ;

Another option would be to let the lexer create single char tokens only, and let the parser glue these together (much like a PEG). Something like this:

regExp            :     branch ( '|' branch )* ;
branch            :     piece* ;
piece             :     atom quantifier? ;
quantifier        :     Quantifiers | '{'quantity'}' ;
quantity          :     quantRange | quantMin | quantExact ;
quantRange        :     quantExact ',' quantExact ;
quantMin          :     quantExact ',' ;
atom              :     normalChar | '(' regExp ')' ; 
normalChar        :     NormalChar | Digit ;
quantExact        :     Digit+ ;

Digit             :     [0-9] ;
NormalChar        :     ~[.\\?*+{}()|\[\]] ;
Quantifiers       :     [?*+] ;

How to parse grammar of XSD Regex with ANTLR4?

1 Answers