1
votes

I'm trying to write an ANTLR4 lexer for some language. I've got a working one, but I'm not entirely satisfied with it.

keyword "my:little:uri" + /* my comment here */ ':it:is'
// nasty comment
+ ":mehmeh"; // single line comment

keyword + {}

This is an example of statements in the language. It's simply a bunch of keywords followed by string arguments and terminated by a semicolon or a block of sub-statements. Strings may be unquoted, single-quoted or double-quoted. The quoted strings may be concatenated as in the example above. An unquoted string containing a plus sign (+) is valid.

What I find problematic are the comments. I'd like to recognize whatever follows a keyword as a single string token, sans the comments (and whitespace). I'd usually use the more lexer command but I don't think it's applicable for the example above. Is there a pattern that would allow me achieve something like this?

My current lexer grammar:

lexer grammar test;

@members {
    public static final int CHANNEL_COMMENTS = 1;
}

WHITESPACE : (' ' | '\t' | '\n' | '\r' | '\f') -> skip;

SINGLE_LINE_COMMENT : '//' (~[\n\r])* ('\n' | '\r' | '\r\n')? -> channel(CHANNEL_COMMENTS);

MULTI_LINE_COMMENT : '/*' .*? '*/' -> channel(CHANNEL_COMMENTS);

KEYWORD :  'keyword' -> pushMode(IN_STRING_KEYWORD);

LBRACE : '{';
RBRACE : '}';
SEMICOLON : ';';

mode IN_STRING_KEYWORD;
STRING_WHITESPACE : WHITESPACE -> skip;
STRING_SINGLE_LINE_COMMENT : SINGLE_LINE_COMMENT -> type(SINGLE_LINE_COMMENT), channel(CHANNEL_COMMENTS);
STRING_MULTI_LINE_COMMENT : MULTI_LINE_COMMENT -> type(MULTI_LINE_COMMENT), channel(CHANNEL_COMMENTS);
STRING_LBRACE : LBRACE -> type(LBRACE), popMode;
STRING_SEMICOLON : SEMICOLON -> type(SEMICOLON), popMode;
STRING : ((QUOTED_STRING ('+' QUOTED_STRING)*) | UNQUOTED_STRING);
fragment QUOTED_STRING : (SINGLEQUOTED_STRING | DOUBLEQUOTED_STRING);
fragment UNQUOTED_STRING : (~[ \t;{}/*'"\n\r] | '/' ~[/*] | '*' ~['/'])+;
fragment SINGLEQUOTED_STRING : '\'' (~['])* '\'';
fragment DOUBLEQUOTED_STRING : 
    '"'
      (
        (~["\\]) |
        ('\\' [nt"\\])
      )* 
    '"'
;

Am I perhaps trying to do too much inside the lexer and should just feed what I currently have to the parser and let it handle the above mess?

Edit01

Thanks to 280Z28, I decided to fix the above lexer grammar by getting rid of my STRING token and simply settling for QUOTED_STRING, UNQUOTED_STRING and the operator CONCAT. The rest will be handled in the parser. I also added an additional lexer mode in order to distinguish between CONCAT and UNQUOTED_STRING.

lexer grammar test;

@members {
    public static final int CHANNEL_COMMENTS = 2;
}

WHITESPACE : (' ' | '\t' | '\n' | '\r' | '\f') -> skip;
SINGLE_LINE_COMMENT : '//' (~[\n\r])*  -> channel(CHANNEL_COMMENTS);
MULTI_LINE_COMMENT : '/*' .*? '*/' -> channel(CHANNEL_COMMENTS);

KEYWORD :  'keyword' -> pushMode(IN_STRING_KEYWORD);

LBRACE : '{';
RBRACE : '}';
SEMICOLON : ';';

mode IN_STRING_KEYWORD;
STRING_WHITESPACE : WHITESPACE -> skip;
STRING_SINGLE_LINE_COMMENT : SINGLE_LINE_COMMENT -> type(SINGLE_LINE_COMMENT), channel(CHANNEL_COMMENTS);
STRING_MULTI_LINE_COMMENT : MULTI_LINE_COMMENT -> type(MULTI_LINE_COMMENT), channel(CHANNEL_COMMENTS);
STRING_LBRACE : LBRACE -> type(LBRACE), popMode;
STRING_SEMICOLON : SEMICOLON -> type(SEMICOLON), popMode;
QUOTED_STRING : (SINGLEQUOTED_STRING | DOUBLEQUOTED_STRING) -> mode(IN_QUOTED_STRING);
UNQUOTED_STRING : (~[ \t;{}/*'"\n\r] | '/' ~[/*] | '*' ~[/])+;
fragment SINGLEQUOTED_STRING : '\'' (~['])* '\'';
fragment DOUBLEQUOTED_STRING : 
    '"'
      (
        (~["\\]) |
        ('\\' [nt"\\])
      )* 
    '"'
;

mode IN_QUOTED_STRING;
QUOTED_STRING_WHITESPACE : WHITESPACE -> skip;
QUOTED_STRING_SINGLE_LINE_COMMENT : SINGLE_LINE_COMMENT -> type(SINGLE_LINE_COMMENT), channel(CHANNEL_COMMENTS);
QUOTED_STRING_MULTI_LINE_COMMENT : MULTI_LINE_COMMENT -> type(MULTI_LINE_COMMENT), channel(CHANNEL_COMMENTS);
QUOTED_STRING_LBRACE : LBRACE -> type(LBRACE), popMode;
QUOTED_STRING_SEMICOLON : SEMICOLON -> type(SEMICOLON), popMode;
QUOTED_STRING2 : QUOTED_STRING -> type(QUOTED_STRING);
CONCAT : '+';
1
You should include the exact semantics of each type of string in your question (especially unquoted strings).Sam Harwell
@280Z28, this can be seen from my grammar. Or did you mean in human readable form?predi
The problem is, if your grammar was working correctly you wouldn't need to ask the question. Including a separate description helps clarify what you are trying to do so I can compare it to what you actually did. :)Sam Harwell

1 Answers

1
votes
  • Don't perform string concatenation in the lexer. Send the + operator to the parser as an operator. This will make it much easier to eliminate the whitespace and/or comments appearing between strings and the operator.

    CONCAT : '+';
    STRING : QUOTED_STRING | UNQUOTED_STRING;
    
  • You should be aware that ANTLR 4 changed the predefined HIDDEN channel from 99 to 1, so HIDDEN and CHANNEL_COMMENTS are the same in your grammar.

  • Don't include the line terminator at the end of the SINGLE_LINE_COMMENT rule.

    SINGLE_LINE_COMMENT
        :   '//' (~[\n\r])*
            -> channel(CHANNEL_COMMENTS)
        ;
    
  • Your UNQUOTED_STRING token currently contains the set ['/']. If you meant to exclude ' characters, the second ' in the set is redundant so you can use ['/]. If you only meant to exclude /, then you can use either the syntax [/] or '/'.