Why the token rule (in ANTLR) " IDENT : LETTER (LETTER | DIGIT)*; " does not recognize "x y z"?

Question

Say I have a piece of ANTLR grammar (lexer part)

fragment LETTER : ('a'..'z' | 'A'..'Z') ;
fragment DIGIT : '0'..'9';
INTEGER : DIGIT+ ;
Ident : LETTER (LETTER | DIGIT)*;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel = HIDDEN;};
COMMENT : '//' .* ('\n'|'\r') {$channel = HIDDEN;};

I am thinking that, since WS eats all the white spaces between token, both "x y z" and "xyz" should have been recognizied as the same token of Ident. But apparently only "x y z" will be considered as 3 Ident. So I really feel confused about the behavior when white space is encountered for a lexer rule.

More concretely, I have a rule

VARIABLE: ('A'..'Z')+ DIGIT*  ;

I want it to recognize variables identities like X3, Y4, XX55, etc. But surprisingly, this rule recognizes " X Y" So this seems to be totally incomprehensible. What is your idea?

True Soft True Soft · Accepted Answer · 2011-11-14T20:39:34

Ident : LETTER (LETTER | DIGIT)*; means that an Ident is a letter followed by zero or more letters or digits. NO whitespaces!
That's why "x y z" are recognized like 3 Ident

Why the token rule (in ANTLR) " IDENT : LETTER (LETTER | DIGIT)*; " does not recognize "x y z"?

2 Answers