ANTLR3: match exactly one character in token

Question

I am a newbie on ANTLR and working on a parser using ANTLR3, but having trouble with the following situation. In the text we parse there can be multiple situation where the ^-character occurs. However, there is one special case where '^' is followed by exactly one character. This occurs in strings:

'MyText'^M
^MyValue

In the first situation '^M' is part of a string where ^M indicates 13 hex, but in the second it is not; there it is a Pointer indicator. The second situation is captured in the grammar rules (the ^-character is used in multiple rules).

If I solve it with the following tokens, it will fail, because '^MyValue' is tokenized in '^M' and 'yValue'. However, I want the token ControlChar only be used if there is exactly one character following ^. Otherwise it should be ignored and not tokenized so it can be used in the grammar.

Pointer                 : '^'
                        ;
QuotedString            : '\'' ('\'\'' | ~('\''))* '\''
                        ;
TkIdentifier            : (Alpha | '_') (Alpha | Digit | '_')*
                        ;
ControlString           : Controlchar (Controlchar)*
                        ;
fragment
Controlchar             : '#' Digitseq
                        | '#' '$' Hexdigitseq
                        | '^' Alpha
                        ;
fragment
Alpha                   : 'a'..'z'
                        | 'A'..'Z'
                        ;
fragment
Digit                   : '0'..'9'
                        ;

So, my question is. How can I instruct ANTLR that '^' Alpha is only matched if there is exactly one Alpha following this character and otherwise leave '^' in the text and tokenize the Alpha, Digits or '_' as a TkIdentifier token?

For example, the lexer should create the following tokens:

^Foo -> Pointer TkIdentifier
^F oo -> ControlChar TkIdentifier
^ F oo -> Pointer TkIdentifier TkIdentifier
Foo^M -> TkIdentifier ControlChar
Foo ^ M -> TkIdentifier Pointer TkIdentifier
Foo ^M -> TkIdentifier ControlChar
Foo^ M -> TkIdentifier Pointer TkIdentifier

'Text'^M -> QuotedString ControlChar
'Text' ^M -> QuotedString ControlChar
'Text' ^ M -> QuotedString Pointer TkIdentifier
^M'Text' -> ControlChar QuotedString
^M 'Text' -> ControlChar QuotedString
^ M'Text' -> Pointer TkIdentifier QuotedString

True, but it depends, there are 4 cases where ^M can occur: 1) string := 'MyText'^M 2) string := 'MyText'^M'MyText2' 3) string := ^M'MyText' 4) string := ^M. We use the following token to remove whitespaces WS : (' '|'\t'|'\r'|'\n'|'\f')+ {$channel=HIDDEN;} — Laurens
It's not clear to me what your requirements exactly are. Could you edit your original question and add for the following 5 examples what tokens you want your lexer to produce: 1) foo^M, 2) foo ^ M, 3) foo ^M, 4) ^Mfoo, 5) foo^ M. — Bart Kiers
I have edited my original post and I hope it is more clear to you now. I have to use ANTLR3 because the grammar and code I'm editing (it is an existing project) uses ANTLR3 everywhere. — Laurens

Bart Kiers Bart Kiers · Accepted Answer · 2020-04-17T09:29:39

In this case, you'll have to use target specific code inside a predicate in your grammar.

What you'll have to do is this: whenever the lexer stumbles upon a ^, it will have to look 2 characters ahead in the stream. If those 2 characters are a word and a non-word, it will create a Controlchar token including the Alpha that follows ^. If not, create a Pointer from the ^.

For ANTLR3 with Java as the target language, that might look like this:

// Cannot be a fragment now, because we're changing the `type` in certain cases. And because it is
// no fragment any more, it has to come before the `ControlString` rule.
Controlchar
 : '^' ( // Execute the predicate, which looks ahead 2 chars and passes if 
         // these 2 chars are a word and a non-word
         {((char)input.LA(1) + "" + (char)input.LA(2)).matches("\\w\\W")}?=> 
         // If the predicate is true, match a single `Alpha`
         Alpha
       |  // If the predicate failed, change the type of this token to a `Pointer`
         {$type=Pointer;}
       )
 | '#' Digitseq
 | '#' '$' Hexdigitseq
 ;

ControlString
 : Controlchar+
 ;

I ran a quick test:

import org.antlr.runtime.*;

public class Main {

    public static void main(String[] args) throws Exception {

        String[] tests = {
                "^Foo",
                "^F oo",
                "^ F oo",
                "Foo^M",
                "Foo ^ M",
                "Foo ^M",
                "Foo^ M",
                "'Text'^M",
                "'Text' ^M",
                "'Text' ^ M",
                "^M'Text'",
                "^M 'Text'",
                "^ M'Text'",
                "^Q^E^D"
        };

        for (String test : tests) {

            TLexer lexer = new TLexer(new ANTLRStringStream(test));
            CommonTokenStream tokenStream = new CommonTokenStream(lexer);
            tokenStream.fill();

            System.out.printf("\ntest: %-15s tokens: ", test);

            for (Token t : tokenStream.getTokens()) {
                if (t.getType() != -1) {
                    System.out.printf(" %s", TParser.tokenNames[t.getType()]);
                }
            }
        }
    }
}

which printed:

test: ^Foo            tokens:  Pointer TkIdentifier
test: ^F oo           tokens:  Controlchar TkIdentifier
test: ^ F oo          tokens:  Pointer TkIdentifier TkIdentifier
test: Foo^M           tokens:  TkIdentifier Controlchar
test: Foo ^ M         tokens:  TkIdentifier Pointer TkIdentifier
test: Foo ^M          tokens:  TkIdentifier Controlchar
test: Foo^ M          tokens:  TkIdentifier Pointer TkIdentifier
test: 'Text'^M        tokens:  QuotedString Controlchar
test: 'Text' ^M       tokens:  QuotedString Controlchar
test: 'Text' ^ M      tokens:  QuotedString Pointer TkIdentifier
test: ^M'Text'        tokens:  Controlchar QuotedString
test: ^M 'Text'       tokens:  Controlchar QuotedString
test: ^ M'Text'       tokens:  Pointer TkIdentifier QuotedString
test: ^Q^E^D          tokens:  ControlString

Note that you can also keep your grammar (a little) cleaner by moving the embedded code in the lexer::members section:

// Place this in the top of your grammar definition
@lexer::members {
  private boolean isControlchar() {
    // TODO 
    //  - check if there are actually 2 chars ahead and not an EOF
    //  - perhaps something else than a regex match here
    return ((char)input.LA(1) + "" + (char)input.LA(2)).matches("\\w\\W");
  }
}

...

Controlchar
 : '^' ( {isControlchar()}?=> Alpha
       | {$type=Pointer;}
       )
 | '#' Digitseq
 | '#' '$' Hexdigitseq
 ;

Because the type of the token gets changed programatically, a lexer rule like Controlstring : Controlchar+; will also match ^^^ (three Pointer tokens). If possible, you could create a parser rule instead:

controlstring
 : Controlchar+
 ;

ANTLR3: match exactly one character in token

1 Answers