I am stuck with a pretty simple grammar. Googling and books reading did not help. I started to use ANTLR quite recently, so probably this is a very simple question.
I am trying to write a very simple Lexer using ANTLR v3.
grammar TestLexer;
options {
language = Java;
}
TEST_COMMENT
: '/*' WS? TEST WS? '*/'
;
ML_COMMENT
: '/*' ( options {greedy=false;} : .)* '*/' {$channel=HIDDEN;}
;
TEST : 'TEST'
;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel=HIDDEN;}
;
The test class:
public class TestParserInvoker {
private static void extractCommandsTokens(final String script) throws RecognitionException {
final ANTLRStringStream input = new ANTLRStringStream(script);
final Lexer lexer = new TestLexer(input);
final TokenStream tokenStream = new CommonTokenStream(lexer);
Token t;
do {
t = lexer.nextToken();
if (t != null) {
System.out.println(t);
}
} while (t == null || t.getType() != Token.EOF);
}
public static void main(final String[] args) throws RecognitionException {
final String script = "/* TEST */";
extractCommandsTokens(script);
}
}
So when test string is "/* TEST */" the lexer produces as expected two tokens. One with type TEST_COMMENT and one with EOF. Everything is OK.
But if test string contains one extra space in the end: "/* TEST */ " lexer produces three tokens: ML_COMMENT, WS and EOF.
Why does first token get ML_COMMENT type? I thought the way how token detected depends only on precedence of lexer rules in grammar. And of course it should not depend on following tokens.
Thanks for help!
P.S. I can use lexer option filter=true - token will get the correct type, but this approach requires extra work in tokens definitions. To be honest, I do not want to use this type of lexer.
WS?
another rule? With being in a hidden channel or skipped, it does would never occur in another rule. – kayWS
only gets put on the HIDDEN channel if it's a token of its own. When part of another rule, the white space chars it matches are on the channel of that particular token. – Bart Kiers