I am trying to match below text with an ANTLR grammar:
The ANTLR grammar is:
grammar header;
start : commentBlock
EOF;
commentBlock : CommentLine+;
CommentLine : '#' AsciiChars+;
AsciiChars : [a-zA-Z];
fragment CR : '\r';
fragment LF : '\n';
EOL : CR?LF ->skip;
fragment Tab : '\t';
fragment Space : ' ';
TS : (Tab|Space)+ ->skip;
The error I got is:
line 1:0 token recognition error at: '# '
line 2:0 token recognition error at: '# '
line 3:0 token recognition error at: '# '
[@0,2:2='a',<AsciiChars>,1:2]
[@1,7:7='b',<AsciiChars>,2:2]
[@2,12:12='c',<AsciiChars>,3:2]
[@3,15:14='<EOF>',<EOF>,4:0]
line 1:2 mismatched input 'a' expecting CommentLine
I guess the grammar is reasonable, but why the error is happening?
ADD 1
Strange, after I changed the lexer rule CommentLine
into a parser rule commentLine
, it works:
grammar header;
start : commentBlock
EOF;
commentBlock : commentLine+;
commentLine : '#' AsciiChars+; // <=== here CommentLine -> commentLine
AsciiChars : [a-zA-Z];
fragment CR : '\r';
fragment LF : '\n';
EOL : CR?LF ->skip;
fragment Tab : '\t';
fragment Space : ' ';
TS : (Tab|Space)+ ->skip;
But actually I want to discard all the comment lines. If it has to be a parser rule, I cannot use ->skip
to discard it.
ADD 2
I think I can explain it now.
The critial things to remember are:
- lexer phase happens before parser phase.
- If a skipped token T1 is referenced by another lexer rule, say token T2, the token T1 part within token T2 will not be skipped.
Let me explain it with a concise example:
The document to match:
# abc
Grammar 1:
grammar test;
t : T2;
p : t
EOF;
Char : [a-z];
T2 : '#' T1+ Char+; // <<<< Here T2 reference the so-skipped T1.
fragment Tab : '\t';
fragment Space : ' ';
T1 : (Tab|Space)+ ->skip; //<<<<< T1 is to be skipped.
In grammar 1, T1 is skipped, but the T1 part in T2 is not skipped. T2 will match the input text in the lexer phase. (Even we put the T2 after T1, T2 will still match. I think ANTLR did some greedy match to match for the longest token.)
Grammar 2:
The skipped T1 is not referenced by another token rule, but directly in a parser rule.
grammar test;
t : '#' T1+ Char+; // <<<<<<<<<<<< HERE
p : t
EOF;
Char : [a-z];
fragment Tab : '\t';
fragment Space : ' ';
T1 : (Tab|Space)+ ->skip; //<<<<< T1 is to be skipped.
This time, no T2 rule to help the spaces to survive the lexer phase, all T1 in the input file will be skipped. So when in the parser phase afterwards, the matching will fail with this error:
[@0,0:0='#',<'#'>,1:0]
[@1,4:4='a',<Char>,1:4]
[@2,5:5='b',<Char>,1:5]
[@3,6:6='c',<Char>,1:6]
[@4,7:6='<EOF>',<EOF>,1:7]
line 1:4 mismatched input 'a' expecting T1
Because all T1 are already discarded in lexer phase.
ADD 3
Back to my original question, the subtle mistake I made is, I thought after the TS
is skipped, the remaining characters can be re-grouped into the new token CommentLine
, which has no spaces. This is plain wrong with ANTLR.
Because lexer phase all happens before parser phase, the CommentLine
is a token rule, it has no spaces in it, so it won't match anything in the input content.
So just as @macmoonshine said, I do have to add TS
into the CommentLine
token.