undescores seen as white spaces. Is it normal?

Question

In my grammar, I have this for white spaces:

WS:
    (' '|'\r'|'\t'|'\n') -> skip
;

However, the parser does not choke if I put an undescore instead of a space.

My-first-module_DEFINITIONS_::=

is recognized as

My-first-module DEFINITIONS ::=

Is there an option I have to set somehwere in the lexer ?

Thanks

Here is the reduced grammar that helps reproduce what I see

grammar ASN;

/*--------------------- Module definition -------------------------------------------*/

/* ModuleDefinition (see 13 in ITU-T X.680 (08/2015) */
moduleDefinition:  
    moduleIdentifier
    DEFINITIONS_LITERAL
    ASSIGN
    BEGIN_LITERAL
    END_LITERAL
;

moduleIdentifier: 
    UCASE_ID 
;



/*--------------------- LITERAL -----------------------------------------------------*/

DEFINITIONS_LITERAL:
    'DEFINITIONS'
;

BEGIN_LITERAL:
    'BEGIN'
;

END_LITERAL:
    'END'
;

ASSIGN:
    '::='
;

UCASE_ID:
    ('A'..'Z') ('-'('a'..'z'|'A'..'Z'|'0'..'9')|('a'..'z'|'A'..'Z'|'0'..'9'))* 
;


/* white-space (see 12.1.6 in ITU-T X.680 (08/2015) */
WS:
    (' '|'\r'|'\t'|'\n') -> skip
;

and the example that should not be accepted by the parser:

My-first-module_DEFINITIONS_::= 
BEGIN 

END

EDIT: I realize my problem is due to the fact I am using JUnit to run my test and I just check the syntax errors found by the parser. Here is the code, including Bart's answer, that makes the test fail if the lexer has issues ...

// load test data
InputStream inStream = getClass().getClassLoader().getResourceAsStream(resourceName);

if (inStream == null) {
    throw new RuntimeException("Resource not found: " + resourceName);
}

// create a CharStream that reads from standard input
CharStream input = new ANTLRInputStream(inStream);

// create a lexer that feeds off of input CharStream
ASNLexer lexer = new ASNLexer(input);
lexer.addErrorListener(new BaseErrorListener() {
    public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine, String msg, RecognitionException e) {
        throw new RuntimeException(e);
        }
    }
);
// create a buffer of tokens pulled from the lexer
TokenStream tokens = new CommonTokenStream(lexer);
// create a parser that feeds off the tokens buffer
ASNParser parser = new ASNParser(tokens);
parser.moduleDefinition(); // begin parsing at moduleDefinition rule
assert(0 == parser.getNumberOfSyntaxErrors());

Could be that the lexer or parser recovers from it, could be something else. Impossible to say without seeing a "Minimal, Complete, and Verifiable example" (see: stackoverflow.com/help/mcve) — Bart Kiers
I'll put my stuff online. By your answer, I gather this is not normal ? — YaFred
"I gather this is not normal ?" - no, it's most likely ANTLR performs as expected. — Bart Kiers
"I'll put my stuff online" - no need to post a hundreds of LOC, just enough to reproduce the problem. And please add the code to your question, not some off-site location. — Bart Kiers

Bart Kiers Bart Kiers · Accepted Answer · 2018-02-26T17:01:15

The lexer recovers from the unexpected input. You can see this by running this class:

public class Main {

  public static void main(String[] args) {

    String source = "My-first-module_DEFINITIONS_::= \n" +
        "BEGIN \n" +
        "\n" +
        "END";

    ASNLexer lexer = new ASNLexer(CharStreams.fromString(source));
    ASNParser parser = new ASNParser(new CommonTokenStream(lexer));
    parser.moduleDefinition();
  }
}

which will print the following to your stdout:

line 1:15 token recognition error at: '_'
line 1:27 token recognition error at: '_'

There are a couple of options here:

1. add a catch-all rule

Add such a rule at the end of your grammar:

Other
 : .
 ;

and then handle Other in your parser as you see fit.

2. add custom `ErrorListener`

Do something like this:

lexer.addErrorListener(new BaseErrorListener(){
  @Override
  public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine, String msg, RecognitionException e) {
    throw new RuntimeException(e);
  }
});

that will cause any errors in the lexer to throw a RuntimeException.

Note that ANTLR4 supports a more compact notation of defining character sets like this:

UCASE_ID:
    [A-Z] ( '-'? [a-zA-Z0-9] )*
;

WS:
    [ \t\r\n] -> skip
;

undescores seen as white spaces. Is it normal?

1 Answers

1. add a catch-all rule

2. add custom ErrorListener

2. add custom `ErrorListener`