I am a newbie on ANTLR and working on a parser using ANTLR3, but having trouble with the following situation. In the text we parse there can be multiple situation where the ^-character occurs. However, there is one special case where '^' is followed by exactly one character. This occurs in strings:
- 'MyText'^M
- ^MyValue
In the first situation '^M' is part of a string where ^M indicates 13 hex, but in the second it is not; there it is a Pointer indicator. The second situation is captured in the grammar rules (the ^-character is used in multiple rules).
If I solve it with the following tokens, it will fail, because '^MyValue' is tokenized in '^M' and 'yValue'. However, I want the token ControlChar only be used if there is exactly one character following ^. Otherwise it should be ignored and not tokenized so it can be used in the grammar.
Pointer : '^'
;
QuotedString : '\'' ('\'\'' | ~('\''))* '\''
;
TkIdentifier : (Alpha | '_') (Alpha | Digit | '_')*
;
ControlString : Controlchar (Controlchar)*
;
fragment
Controlchar : '#' Digitseq
| '#' '$' Hexdigitseq
| '^' Alpha
;
fragment
Alpha : 'a'..'z'
| 'A'..'Z'
;
fragment
Digit : '0'..'9'
;
So, my question is. How can I instruct ANTLR that '^' Alpha
is only matched if there is exactly one Alpha following this character and otherwise leave '^' in the text and tokenize the Alpha, Digits or '_' as a TkIdentifier token?
For example, the lexer should create the following tokens:
^Foo -> Pointer TkIdentifier
^F oo -> ControlChar TkIdentifier
^ F oo -> Pointer TkIdentifier TkIdentifier
Foo^M -> TkIdentifier ControlChar
Foo ^ M -> TkIdentifier Pointer TkIdentifier
Foo ^M -> TkIdentifier ControlChar
Foo^ M -> TkIdentifier Pointer TkIdentifier
'Text'^M -> QuotedString ControlChar
'Text' ^M -> QuotedString ControlChar
'Text' ^ M -> QuotedString Pointer TkIdentifier
^M'Text' -> ControlChar QuotedString
^M 'Text' -> ControlChar QuotedString
^ M'Text' -> Pointer TkIdentifier QuotedString
string := 'MyText'^M
2)string := 'MyText'^M'MyText2'
3)string := ^M'MyText'
4)string := ^M
. We use the following token to remove whitespacesWS : (' '|'\t'|'\r'|'\n'|'\f')+ {$channel=HIDDEN;}
– Laurensfoo^M
, 2)foo ^ M
, 3)foo ^M
, 4)^Mfoo
, 5)foo^ M
. – Bart Kiers