ANTLR grammar to match the comment lines started with #

Question

I am trying to match below text with an ANTLR grammar:

The ANTLR grammar is:

grammar header;


start : commentBlock
        EOF;

commentBlock : CommentLine+;
CommentLine  : '#' AsciiChars+;
AsciiChars : [a-zA-Z];

fragment CR : '\r';
fragment LF : '\n';
EOL : CR?LF ->skip;

fragment Tab : '\t';
fragment Space : ' ';
TS : (Tab|Space)+ ->skip;

The error I got is:

line 1:0 token recognition error at: '# '
line 2:0 token recognition error at: '# '
line 3:0 token recognition error at: '# '
[@0,2:2='a',<AsciiChars>,1:2]
[@1,7:7='b',<AsciiChars>,2:2]
[@2,12:12='c',<AsciiChars>,3:2]
[@3,15:14='<EOF>',<EOF>,4:0]
line 1:2 mismatched input 'a' expecting CommentLine

I guess the grammar is reasonable, but why the error is happening?

ADD 1

Strange, after I changed the lexer rule CommentLine into a parser rule commentLine, it works:

grammar header;

start : commentBlock
        EOF;

commentBlock : commentLine+;
commentLine  : '#' AsciiChars+; // <=== here CommentLine -> commentLine
AsciiChars : [a-zA-Z];

fragment CR : '\r';
fragment LF : '\n';
EOL : CR?LF ->skip;

fragment Tab : '\t';
fragment Space : ' ';
TS : (Tab|Space)+ ->skip;

But actually I want to discard all the comment lines. If it has to be a parser rule, I cannot use ->skip to discard it.

ADD 2

I think I can explain it now.

The critial things to remember are:

lexer phase happens before parser phase.
If a skipped token T1 is referenced by another lexer rule, say token T2, the token T1 part within token T2 will not be skipped.

Let me explain it with a concise example:

The document to match:

#   abc

Grammar 1:

grammar test;

t : T2;
p : t
    EOF;

Char : [a-z];

T2 : '#' T1+ Char+; // <<<< Here T2 reference the so-skipped T1.

fragment Tab : '\t';
fragment Space : ' ';
T1 : (Tab|Space)+ ->skip; //<<<<< T1 is to be skipped.

In grammar 1, T1 is skipped, but the T1 part in T2 is not skipped. T2 will match the input text in the lexer phase. (Even we put the T2 after T1, T2 will still match. I think ANTLR did some greedy match to match for the longest token.)

Grammar 2:

The skipped T1 is not referenced by another token rule, but directly in a parser rule.

grammar test;

t : '#' T1+ Char+; // <<<<<<<<<<<< HERE
p : t
    EOF;

Char : [a-z];

fragment Tab : '\t';
fragment Space : ' ';
T1 : (Tab|Space)+ ->skip; //<<<<< T1 is to be skipped.

This time, no T2 rule to help the spaces to survive the lexer phase, all T1 in the input file will be skipped. So when in the parser phase afterwards, the matching will fail with this error:

[@0,0:0='#',<'#'>,1:0]
[@1,4:4='a',<Char>,1:4]
[@2,5:5='b',<Char>,1:5]
[@3,6:6='c',<Char>,1:6]
[@4,7:6='<EOF>',<EOF>,1:7]
line 1:4 mismatched input 'a' expecting T1

Because all T1 are already discarded in lexer phase.

ADD 3

Back to my original question, the subtle mistake I made is, I thought after the TS is skipped, the remaining characters can be re-grouped into the new token CommentLine, which has no spaces. This is plain wrong with ANTLR.

Because lexer phase all happens before parser phase, the CommentLine is a token rule, it has no spaces in it, so it won't match anything in the input content.

So just as @macmoonshine said, I do have to add TS into the CommentLine token.

Surely there's an "Accept" in there somewhere -- what else can I help with to get an Accept ;) — TomServo

clemens clemens · Accepted Answer · 2017-07-30T09:37:37

Your grammar does not include spaces in comments, but your comments does.

EDIT: Have you tried commentLine : '#' TS AsciiChars; as comment rule?

ANTLR grammar to match the comment lines started with #

ADD 1

ADD 2

ADD 3

3 Answers

TO IGNORE COMMENTS

TO REROUTE COMMENTS