1
votes

I am trying to match below text with an ANTLR grammar:

enter image description here

The ANTLR grammar is:

grammar header;


start : commentBlock
        EOF;

commentBlock : CommentLine+;
CommentLine  : '#' AsciiChars+;
AsciiChars : [a-zA-Z];

fragment CR : '\r';
fragment LF : '\n';
EOL : CR?LF ->skip;

fragment Tab : '\t';
fragment Space : ' ';
TS : (Tab|Space)+ ->skip;

The error I got is:

line 1:0 token recognition error at: '# '
line 2:0 token recognition error at: '# '
line 3:0 token recognition error at: '# '
[@0,2:2='a',<AsciiChars>,1:2]
[@1,7:7='b',<AsciiChars>,2:2]
[@2,12:12='c',<AsciiChars>,3:2]
[@3,15:14='<EOF>',<EOF>,4:0]
line 1:2 mismatched input 'a' expecting CommentLine

I guess the grammar is reasonable, but why the error is happening?

ADD 1

Strange, after I changed the lexer rule CommentLine into a parser rule commentLine, it works:

grammar header;

start : commentBlock
        EOF;

commentBlock : commentLine+;
commentLine  : '#' AsciiChars+; // <=== here CommentLine -> commentLine
AsciiChars : [a-zA-Z];

fragment CR : '\r';
fragment LF : '\n';
EOL : CR?LF ->skip;

fragment Tab : '\t';
fragment Space : ' ';
TS : (Tab|Space)+ ->skip; 

But actually I want to discard all the comment lines. If it has to be a parser rule, I cannot use ->skip to discard it.

ADD 2

I think I can explain it now.

The critial things to remember are:

  • lexer phase happens before parser phase.
  • If a skipped token T1 is referenced by another lexer rule, say token T2, the token T1 part within token T2 will not be skipped.

Let me explain it with a concise example:

The document to match:

#   abc

Grammar 1:

grammar test;

t : T2;
p : t
    EOF;

Char : [a-z];

T2 : '#' T1+ Char+; // <<<< Here T2 reference the so-skipped T1.

fragment Tab : '\t';
fragment Space : ' ';
T1 : (Tab|Space)+ ->skip; //<<<<< T1 is to be skipped.

In grammar 1, T1 is skipped, but the T1 part in T2 is not skipped. T2 will match the input text in the lexer phase. (Even we put the T2 after T1, T2 will still match. I think ANTLR did some greedy match to match for the longest token.)

Grammar 2:

The skipped T1 is not referenced by another token rule, but directly in a parser rule.

grammar test;

t : '#' T1+ Char+; // <<<<<<<<<<<< HERE
p : t
    EOF;

Char : [a-z];

fragment Tab : '\t';
fragment Space : ' ';
T1 : (Tab|Space)+ ->skip; //<<<<< T1 is to be skipped.

This time, no T2 rule to help the spaces to survive the lexer phase, all T1 in the input file will be skipped. So when in the parser phase afterwards, the matching will fail with this error:

[@0,0:0='#',<'#'>,1:0]
[@1,4:4='a',<Char>,1:4]
[@2,5:5='b',<Char>,1:5]
[@3,6:6='c',<Char>,1:6]
[@4,7:6='<EOF>',<EOF>,1:7]
line 1:4 mismatched input 'a' expecting T1

Because all T1 are already discarded in lexer phase.

ADD 3

Back to my original question, the subtle mistake I made is, I thought after the TS is skipped, the remaining characters can be re-grouped into the new token CommentLine, which has no spaces. This is plain wrong with ANTLR.

Because lexer phase all happens before parser phase, the CommentLine is a token rule, it has no spaces in it, so it won't match anything in the input content.

So just as @macmoonshine said, I do have to add TS into the CommentLine token.

3
Be sure to check out the many options I suggest below.TomServo
Surely there's an "Accept" in there somewhere -- what else can I help with to get an Accept ;)TomServo

3 Answers

1
votes

Your grammar does not include spaces in comments, but your comments does.

EDIT: Have you tried commentLine : '#' TS AsciiChars; as comment rule?

1
votes

Perhaps you're looking for :

grammar Header;

start : CommentLine+ EOF;

CommentLine  : '#' ' ' AsciiChars+;
AsciiChars : [a-zA-Z];

fragment CR : '\r';
fragment LF : '\n';
EOL : CR?LF ->skip;

fragment Tab : '\t';
fragment Space : ' ';
TS : (Tab|Space)+ ->skip; 

Now this one uses just a lexer rule.

TO IGNORE COMMENTS

grammar Header;

start : CommentLine+ EOF;

CommentLine  : '#' ' ' AsciiChars+ -> skip;
AsciiChars : [a-zA-Z];

fragment CR : '\r';
fragment LF : '\n';
EOL : CR?LF ->skip;

fragment Tab : '\t';
fragment Space : ' ';
TS : (Tab|Space)+ ->skip;

This will ignore the commments entirely, and in fact gives an error as written because the rule 'start expects a CommentLine which now is discarded. So if you want to ignore and discard comments, use something like this second one and don't make mention of CommentLine in your parser rules, just let the lexer strip them. Or if you want to preserve comments, you can use the previous one.

TO REROUTE COMMENTS

A final idea is to reroute comments to another channel:

grammar Header;

start : other EOF;
other: AsciiChars;
CommentLine  : '#' ' ' AsciiChars+ -> channel(2);
AsciiChars : [a-zA-Z]+;

fragment CR : '\r';
fragment LF : '\n';
EOL : CR?LF ->skip;

fragment Tab : '\t';
fragment Space : ' ';
TS : (Tab|Space)+ ->skip;

In this grammar, comments are still lexed, but routed to another channel for possible processing. And I added another rule in start just so there'd be something to match in:

# a
# b
something
# c

[@0,0:2='# a',<CommentLine>,channel=2,1:0]
[@1,5:7='# b',<CommentLine>,channel=2,2:0]
[@2,10:18='something',<AsciiChars>,3:0]
[@3,21:23='# c',<CommentLine>,channel=2,4:0]
[@4,26:25='<EOF>',<EOF>,5:0]

One of these options should surely do it for you ;)

0
votes

Try this: It appears your comment is the same as a normal single line comment with the '#' swapped for '//'. If you require a space after the hash use: '# '. If you require the hash to be in column 1 use: [\n\r] '# ' ~[\n\r]. From looking at the example this should cover all the potential options.

COMMENT_LINE
    : '#'  ~[\n\r]* ->( skip )
    ;