2
votes

Say I have a piece of ANTLR grammar (lexer part)

fragment LETTER : ('a'..'z' | 'A'..'Z') ;
fragment DIGIT : '0'..'9';
INTEGER : DIGIT+ ;
Ident : LETTER (LETTER | DIGIT)*;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel = HIDDEN;};
COMMENT : '//' .* ('\n'|'\r') {$channel = HIDDEN;};

I am thinking that, since WS eats all the white spaces between token, both "x y z" and "xyz" should have been recognizied as the same token of Ident. But apparently only "x y z" will be considered as 3 Ident. So I really feel confused about the behavior when white space is encountered for a lexer rule.

More concretely, I have a rule

VARIABLE: ('A'..'Z')+ DIGIT*  ;

I want it to recognize variables identities like X3, Y4, XX55, etc. But surprisingly, this rule recognizes " X Y" So this seems to be totally incomprehensible. What is your idea?

2

2 Answers

3
votes

Ident : LETTER (LETTER | DIGIT)*; means that an Ident is a letter followed by zero or more letters or digits. NO whitespaces!
That's why "x y z" are recognized like 3 Ident

1
votes

Although you've put WS on the HIDDEN channel, "x y z" are three Ident tokens since the WS tokens are only discarded in parser rules, not inside lexer rules.

More concretely, I have a rule

   VARIABLE: ('A'..'Z')+ DIGIT*  ;

I want it to recognize variables identities like X3, Y4, XX55, etc. But surprisingly, this rule recognizes " X Y" So this seems to be totally incomprehensible. What is your idea?

No, the rule VARIABLE does not match " X Y" (including spaces): you must be doing something wrong.