1
votes

When a rule matches in antlr4, and you get the text of that rule, the whitespace is commonly stripped out by the lexer with

WS: [ \n\t\r]+ -> skip;

Is it possible to ask in a parse tree visitor "Did this rule skip over any whitespace?"

E.g.

WS: [ \n\t\r]+ -> skip;
ALPHA: [a-z];
NUMERIC: [0-9];

myrule: (ALPHA | NUMERIC)+;

Then in the visitor (I'm using C++):

antlrcpp::Any MyVisitor::visitMyrule(dlParser::MyruleContext *ctx) {
    if (ctx->didSkipSomeWhitespace()) {
        /* There was whitespace */
    } else {
        /* There was no whitespace */
    }
    return false;
}

So:

f56fhj => no whitespace
o9f g66ff o => whitespace

I've tried getting the start/stop indices of the token so that I can compare the text length against the number of characters that went into it, but the stop token is not always available, and if it is then the values don't line up with the indices that I expect, and it does not appear to be simple to access the original input characters that formed the token.

1

1 Answers

6
votes

In that case, you shouldn't skip these space tokens. That way the parsers has no knowledge of them. Instead, you should put these space tokens on a different channel (HIDDEN, for example). That way, the parser does not use these HIDDEN tokens, but the tokens are present on the tokens stream and can be accessed in your code.

A quick demo in Java (I don't have C++ running):

grammar IntList;

list
 : '[' ( list_item ( ',' list_item )* )? ']' EOF
 ;

list_item
 : INT
 ;

INT
 : '0'
 | [1-9] [0-9]*
 ;

SPACES
 : [ \t\f\r\n] -> channel(HIDDEN)
 ;

Running the class:

import org.antlr.v4.runtime.*;

public class Main {

  public static void main(String[] args) {

    String source = "[1,    2,3,\t4,5]";

    IntListLexer lexer = new IntListLexer(CharStreams.fromString(source));
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    IntListParser parser = new IntListParser(tokens);

    new SpaceInspectionVisitor(tokens).visit(parser.list());
  }
}

class SpaceInspectionVisitor extends IntListBaseVisitor<Object> {

  private final CommonTokenStream tokens;

  SpaceInspectionVisitor(CommonTokenStream tokens) {
    this.tokens = tokens;
  }

  @Override
  public Object visitList_item(IntListParser.List_itemContext ctx) {
    Token previous = tokens.get(ctx.start.getTokenIndex() - 1);
    System.out.printf("token: '%s', previous == SPACES: %s\n", ctx.getText(), previous.getType() == IntListLexer.SPACES);
    return null;
  }
}

will print the following to your console:

token: '1', previous == SPACES: false
token: '2', previous == SPACES: true
token: '3', previous == SPACES: false
token: '4', previous == SPACES: true
token: '5', previous == SPACES: false