0
votes

I am tryingo to parse RegEx and specifically the following:

[A-Z0-9]{1,20}

The problem is, i don't know how to make the following grammar work beacuse the Char and Int tokens are both recognizing the digit.

grammar RegEx;            

regEx : (character count? )+ ;

character : Char 
          | range ;

range  : '[' (rangeChar|rangeX)+ ']' ;
rangeX : rangeStart '-' rangeEnd ;
rangeChar : Char ;
rangeStart : Char ;
rangeEnd : Char ;

count : '{' (countExact | (countMin ',' countMax) ) '}' ;
countMin : D+ ;
countMax : Int ;
countExact : Int ;

channels {
  COUNT_CHANNEL,
  RANGE_CHANNEL
}

Char : D | C ; 
Int : D+ -> channel(COUNT_CHANNEL) ;

Semicolon : ';' ;
Comma : ',' ;
Asterisk : '*' ;
Plus : '+' ; 
Dot : '.' ;  
Dash : '-' ;
//CourlyBracketL : '{' ;
//CourlyBracketR : '}' ;

WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines, \r (Windows)

fragment D : [0-9] ;
fragment C : [a-zA-Z] ;

Now, I'm a noob and I am lost wether should I try lexer modes, channels some ifs or what is the "normal" approach here. Thanks!

1

1 Answers

1
votes

Putting tokens on any channel other than the default hides them from the normal operation of the parser.

Try not to combine tokens in the lexer -- winds up loosing information that can be useful in the parser.

Try this:

grammar RegEx;

regEx   : ( value count? )+ ;

value   : alphNum | range ;
range   : LBrack set+ RBrack ;
set     : b=alphNum ( Dash e=alphNum)? ;

count   : LBrace min=num ( Comma max=num )? RBrace ;

alphNum : Char | Int ;
num     : Int+   ;

Char    : ALPHA  ;
Int     : DIGIT  ;

Semi    : ';' ;
Comma   : ',' ;
Star    : '*' ;
Plus    : '+' ;
Dot     : '.' ;
Dash    : '-' ;
LBrace  : '{' ;
RBrace  : '}' ;
LBrack  : '[' ;
RBrack  : ']' ;

WS : [ \t\r\n]+ -> skip ;

fragment DIGIT : [0-9] ;
fragment ALPHA : [a-zA-Z] ;