ANTLR4 JAVA -Is it possible to extract fragments from the lexer at the Parser Listener point?

Question

I have a Lexer Rule as follows:

PREFIX  : [abcd]'_'; 
EXTRA   : ('xyz' | 'XYZ' );
SUFFIX  : [ab];

TCHAN           :   PREFIX EXTRA? DIGIT+ SUFFIX?;

and a parser rule:

tpin            :   TCHAN
                ;

In the exit_tpin() Listiner method, is there a syntax where I can extract the DIGIT component of the token? Right now I can get the ctx.TCHAN() element, but this is a string. I just want the digit portion of TCHAN.

Or should I remove TCHAN as a TOKEN and move that rule to be tpin (i.e)

tpin : PREFIX EXTRA? DIGIT+ SUFFIX?

Where I know how to extract DIGIT from the listener.

My guess is that by the time the TOKEN is presented to the parser it is too late to deconstruct it... but I was wondering if some ANTLR guru's out there knew of a technique.

If I re-write my TOKENIZER, there is a possiblity that TCHAN tokens will be missed for INT/ID tokens (I think thats why I ended up parsing as I do).

I can always do some regexp work in the listener method... but that seemed like bad form ... as I had the individual components earlier. I'm just lazy, and was wondering if a techniqe other than refactoring the parsing grammar was possible.

BernardK BernardK · Accepted Answer · 2017-02-07T00:51:12

In The Definitive ANTLR Reference you can find examples of complex lexers where much of the work is done. But when learning ANTLR, I would advise to consider the lexer mostly for its splitting function of the input stream into small tokens. Then do the big work in the parser. In the present case I would do :

grammar Question;

/* extract digit */

question
    :   tpin EOF
    ;

tpin
//  :   PREFIX EXTRA? DIGIT+ SUFFIX?
//      {System.out.println("The only useful information is " + $DIGIT.text);}
    :   PREFIX EXTRA? number SUFFIX?
        {System.out.println("The only useful information is " + $number.text);}
    ;

number
    :   DIGIT+
    ;

PREFIX  : [abcd]'_'; 
EXTRA   : ('xyz' | 'XYZ' );
DIGIT   : [0-9] ;
SUFFIX  : [ab];
WS      : [ \t\r\n]+ -> skip ;

Say the input is d_xyz123456b. With the first version

    :   PREFIX EXTRA? DIGIT+ SUFFIX?

you get

$ grun Question question -tokens data.txt
[@0,0:1='d_',<PREFIX>,1:0]
[@1,2:4='xyz',<EXTRA>,1:2]
[@2,5:5='1',<DIGIT>,1:5]
[@3,6:6='2',<DIGIT>,1:6]
[@4,7:7='3',<DIGIT>,1:7]
[@5,8:8='4',<DIGIT>,1:8]
[@6,9:9='5',<DIGIT>,1:9]
[@7,10:10='6',<DIGIT>,1:10]
[@8,11:11='b',<SUFFIX>,1:11]
[@9,13:12='<EOF>',<EOF>,2:0]
The only useful information is 6

Because the parsing of DIGIT+ translates to a loop which reuses DIGIT

    setState(12); 
    _errHandler.sync(this);
    _la = _input.LA(1);
    do {
        {
        {
        setState(11);
        ((TpinContext)_localctx).DIGIT = match(DIGIT);
        }
        }
        setState(14); 
        _errHandler.sync(this);
        _la = _input.LA(1);
    } while ( _la==DIGIT );

and $DIGIT.text translates to ((TpinContext)_localctx).DIGIT.getText(), only the last digit is retained. That's why I define a subrule number

:   PREFIX EXTRA? number SUFFIX?

which makes it easy to capture the value :

[@0,0:1='d_',<PREFIX>,1:0]
[@1,2:4='xyz',<EXTRA>,1:2]
[@2,5:5='1',<DIGIT>,1:5]
[@3,6:6='2',<DIGIT>,1:6]
[@4,7:7='3',<DIGIT>,1:7]
[@5,8:8='4',<DIGIT>,1:8]
[@6,9:9='5',<DIGIT>,1:9]
[@7,10:10='6',<DIGIT>,1:10]
[@8,11:11='b',<SUFFIX>,1:11]
[@9,13:12='<EOF>',<EOF>,2:0]
The only useful information is 123456

You can even make it simpler :

tpin
    :   PREFIX EXTRA? INT SUFFIX?
        {System.out.println("The only useful information is " + $INT.text);}
    ;

PREFIX  : [abcd]'_'; 
EXTRA   : ('xyz' | 'XYZ' );
INT     : [0-9]+ ;
SUFFIX  : [ab];
WS      : [ \t\r\n]+ -> skip ;

$ grun Question question -tokens data.txt
[@0,0:1='d_',<PREFIX>,1:0]
[@1,2:4='xyz',<EXTRA>,1:2]
[@2,5:10='123456',<INT>,1:5]
[@3,11:11='b',<SUFFIX>,1:11]
[@4,13:12='<EOF>',<EOF>,2:0]
The only useful information is 123456

In the listener you have a direct access to these values through the rule context TpinContext :

public static class TpinContext extends ParserRuleContext {
    public Token INT;
    public TerminalNode PREFIX() { return getToken(QuestionParser.PREFIX, 0); }
    public TerminalNode INT() { return getToken(QuestionParser.INT, 0); }
    public TerminalNode EXTRA() { return getToken(QuestionParser.EXTRA, 0); }
    public TerminalNode SUFFIX() { return getToken(QuestionParser.SUFFIX, 0); }

ANTLR4 JAVA -Is it possible to extract fragments from the lexer at the Parser Listener point?

1 Answers