ANTLR4 token image concatenation with comments in the mix

Question

I'm trying to write an ANTLR4 lexer for some language. I've got a working one, but I'm not entirely satisfied with it.

keyword "my:little:uri" + /* my comment here */ ':it:is'
// nasty comment
+ ":mehmeh"; // single line comment

keyword + {}

This is an example of statements in the language. It's simply a bunch of keywords followed by string arguments and terminated by a semicolon or a block of sub-statements. Strings may be unquoted, single-quoted or double-quoted. The quoted strings may be concatenated as in the example above. An unquoted string containing a plus sign (+) is valid.

What I find problematic are the comments. I'd like to recognize whatever follows a keyword as a single string token, sans the comments (and whitespace). I'd usually use the more lexer command but I don't think it's applicable for the example above. Is there a pattern that would allow me achieve something like this?

My current lexer grammar:

lexer grammar test;

@members {
    public static final int CHANNEL_COMMENTS = 1;
}

WHITESPACE : (' ' | '\t' | '\n' | '\r' | '\f') -> skip;

SINGLE_LINE_COMMENT : '//' (~[\n\r])* ('\n' | '\r' | '\r\n')? -> channel(CHANNEL_COMMENTS);

MULTI_LINE_COMMENT : '/*' .*? '*/' -> channel(CHANNEL_COMMENTS);

KEYWORD :  'keyword' -> pushMode(IN_STRING_KEYWORD);

LBRACE : '{';
RBRACE : '}';
SEMICOLON : ';';

mode IN_STRING_KEYWORD;
STRING_WHITESPACE : WHITESPACE -> skip;
STRING_SINGLE_LINE_COMMENT : SINGLE_LINE_COMMENT -> type(SINGLE_LINE_COMMENT), channel(CHANNEL_COMMENTS);
STRING_MULTI_LINE_COMMENT : MULTI_LINE_COMMENT -> type(MULTI_LINE_COMMENT), channel(CHANNEL_COMMENTS);
STRING_LBRACE : LBRACE -> type(LBRACE), popMode;
STRING_SEMICOLON : SEMICOLON -> type(SEMICOLON), popMode;
STRING : ((QUOTED_STRING ('+' QUOTED_STRING)*) | UNQUOTED_STRING);
fragment QUOTED_STRING : (SINGLEQUOTED_STRING | DOUBLEQUOTED_STRING);
fragment UNQUOTED_STRING : (~[ \t;{}/*'"\n\r] | '/' ~[/*] | '*' ~['/'])+;
fragment SINGLEQUOTED_STRING : '\'' (~['])* '\'';
fragment DOUBLEQUOTED_STRING : 
    '"'
      (
        (~["\\]) |
        ('\\' [nt"\\])
      )* 
    '"'
;

Am I perhaps trying to do too much inside the lexer and should just feed what I currently have to the parser and let it handle the above mess?

Edit01

Thanks to 280Z28, I decided to fix the above lexer grammar by getting rid of my STRING token and simply settling for QUOTED_STRING, UNQUOTED_STRING and the operator CONCAT. The rest will be handled in the parser. I also added an additional lexer mode in order to distinguish between CONCAT and UNQUOTED_STRING.

lexer grammar test;

@members {
    public static final int CHANNEL_COMMENTS = 2;
}

WHITESPACE : (' ' | '\t' | '\n' | '\r' | '\f') -> skip;
SINGLE_LINE_COMMENT : '//' (~[\n\r])*  -> channel(CHANNEL_COMMENTS);
MULTI_LINE_COMMENT : '/*' .*? '*/' -> channel(CHANNEL_COMMENTS);

KEYWORD :  'keyword' -> pushMode(IN_STRING_KEYWORD);

LBRACE : '{';
RBRACE : '}';
SEMICOLON : ';';

mode IN_STRING_KEYWORD;
STRING_WHITESPACE : WHITESPACE -> skip;
STRING_SINGLE_LINE_COMMENT : SINGLE_LINE_COMMENT -> type(SINGLE_LINE_COMMENT), channel(CHANNEL_COMMENTS);
STRING_MULTI_LINE_COMMENT : MULTI_LINE_COMMENT -> type(MULTI_LINE_COMMENT), channel(CHANNEL_COMMENTS);
STRING_LBRACE : LBRACE -> type(LBRACE), popMode;
STRING_SEMICOLON : SEMICOLON -> type(SEMICOLON), popMode;
QUOTED_STRING : (SINGLEQUOTED_STRING | DOUBLEQUOTED_STRING) -> mode(IN_QUOTED_STRING);
UNQUOTED_STRING : (~[ \t;{}/*'"\n\r] | '/' ~[/*] | '*' ~[/])+;
fragment SINGLEQUOTED_STRING : '\'' (~['])* '\'';
fragment DOUBLEQUOTED_STRING : 
    '"'
      (
        (~["\\]) |
        ('\\' [nt"\\])
      )* 
    '"'
;

mode IN_QUOTED_STRING;
QUOTED_STRING_WHITESPACE : WHITESPACE -> skip;
QUOTED_STRING_SINGLE_LINE_COMMENT : SINGLE_LINE_COMMENT -> type(SINGLE_LINE_COMMENT), channel(CHANNEL_COMMENTS);
QUOTED_STRING_MULTI_LINE_COMMENT : MULTI_LINE_COMMENT -> type(MULTI_LINE_COMMENT), channel(CHANNEL_COMMENTS);
QUOTED_STRING_LBRACE : LBRACE -> type(LBRACE), popMode;
QUOTED_STRING_SEMICOLON : SEMICOLON -> type(SEMICOLON), popMode;
QUOTED_STRING2 : QUOTED_STRING -> type(QUOTED_STRING);
CONCAT : '+';

You should include the exact semantics of each type of string in your question (especially unquoted strings). — Sam Harwell
@280Z28, this can be seen from my grammar. Or did you mean in human readable form? — predi
The problem is, if your grammar was working correctly you wouldn't need to ask the question. Including a separate description helps clarify what you are trying to do so I can compare it to what you actually did. :) — Sam Harwell

Sam Harwell Sam Harwell · Accepted Answer · 2013-05-13T13:39:12

Don't perform string concatenation in the lexer. Send the + operator to the parser as an operator. This will make it much easier to eliminate the whitespace and/or comments appearing between strings and the operator.
```
CONCAT : '+';
STRING : QUOTED_STRING | UNQUOTED_STRING;
```
You should be aware that ANTLR 4 changed the predefined HIDDEN channel from 99 to 1, so HIDDEN and CHANNEL_COMMENTS are the same in your grammar.

Don't include the line terminator at the end of the SINGLE_LINE_COMMENT rule.

SINGLE_LINE_COMMENT
    :   '//' (~[\n\r])*
        -> channel(CHANNEL_COMMENTS)
    ;

Your UNQUOTED_STRING token currently contains the set ['/']. If you meant to exclude ' characters, the second ' in the set is redundant so you can use ['/]. If you only meant to exclude /, then you can use either the syntax [/] or '/'.

ANTLR4 token image concatenation with comments in the mix

1 Answers