Incorrect Result When ANTLR4 Lexer Action Invokes getText()

Question

It seems that the getText() in a lexer action cannot retrieve the token being matched correctly. Is it a normal behaviour? For example, part of my grammar has these rules for parsing a C++ style identifier that support a \u sequence to embed unicode characters as part of the identifier name:

grammar CPPDefine;
cppCompilationUnit: (id_token|ALL_OTHER_SYMBOL)+ EOF;
id_token:IDENTIFIER //{System.out.println($text);}
;
CRLF: '\r'? '\n' -> skip; 
ALL_OTHER_SYMBOL: '\\';
IDENTIFIER: (NONDIGIT (NONDIGIT | DIGIT)*) 
  {System.out.println(getText());}
;
fragment DIGIT: [0-9];
fragment NONDIGIT: [_a-zA-Z]  | UNIVERSAL_CHARACTER_NAME ;
fragment UNIVERSAL_CHARACTER_NAME: ('\\u' HEX_QUAD  | '\\U' HEX_QUAD HEX_QUAD ) ;
fragment HEX_QUAD: [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f];

Tested with this 1 line input containing an identifier with incorrect unicode escape sequence:

dkk\uzzzz

The $text of the id_token parser rule action produces this correct result:

dkk
uzzzz

i.e. input interpreted as 2 identifiers separated by a symbol '\' (symbol '\' not printed by any parser rule).

However, the getText() of IDENTIFIER lexer rule action produces this incorrect result:

dkk\u
uzzzz

Why the lexer rule IDENTIFIER's getText() is different from the parser id_token rule's $text. Afterall, the parser rule contains only this lexer rule?

EDIT:

Issue observed in ANTLR4.1 but not in ANTLR4.2 so it could have been fixed already.

Can you please include a complete sample to reproduce the issue? — Sam Harwell

Sam Harwell Sam Harwell · Accepted Answer · 2014-03-22T13:01:59

It's hard to tell based on your example, but my instinct is you are using an old version of ANTLR. I am unable to reproduce this issue in ANTLR 4.2.

Incorrect Result When ANTLR4 Lexer Action Invokes getText()

1 Answers