ANTLR4 line comments and text parsing issue

Question

I'm writing the parser of c++ header style file and facing the issue with correct line comment handling.

CustomLexer.g4

lexer grammar CustomLexer;

SPACES          : [ \r\n\t]+ -> skip;
COMMENT_START   : '//' -> pushMode(COMMENT_MODE);
PRAGMA          : '#pragma';
SECTION         : '@section';
DEFINE          : '#define';
UNDEF           : '#undef';
IF              : '#if';
ELIF            : '#elif';
ELSE            : '#else';
IFDEF           : '#ifdef';
IFNDEF          : '#ifndef';
ENDIF           : '#endif';
ENABLED         : 'ENABLED';
DISABLED        : 'DISABLED';
EITHER          : 'EITHER';
ANY             : 'ANY';
DEFINED         : 'defined';
BOTH            : 'BOTH';
BOOLEAN_LITERAL :  'true' | 'false';
STRING          : '"' .*? '"';
HEXADECIMAL     : '0x' ([a-fA-F0-9])+;
LITERAL_SUFFIX  : 'L'|'u'|'U'|'Lu'|'LU'|'uL'|'UL'|'f'|'F';
IDENTIFIER      : [a-zA-Z_] [a-zA-Z_0-9]*;
BLOCK_COMMENT   : '/**' .*? '*/';
NUMBER          : ('-')? Int ('.' Digit*)? | '0';
CHAR_SEQUENCE   : [a-zA-Z_] [a-zA-Z_0-9.]*;
ARRAY_SEQUENCE  : '{' .*?  '}';
OPAREN          : '(';
CPAREN          : ')';
OBRACE          : '{';
CBRACE          : '}';
ADD             : '+';
SUBTRACT        : '-';
MULTIPLY        : '*';
DIVIDE          : '/';
MODULUS         : '%';
OR              : '||';
AND             : '&&';
EQUALS          : '==';
NEQUALS         : '!=';
GTEQUALS        : '>=';
LTEQUALS        : '<=';
GT              : '>';
LT              : '<';
EXCL            : '!';
QMARK           : '?';
COLON           : ':';
COMA            : ',';
OTHER           : .;

fragment Int    : [0-9] Digit* | '0';
fragment Digit  : [0-9];

mode COMMENT_MODE;
  COMMENT_MODE_DEFINE     : '#define' -> type(DEFINE), popMode;
  COMMENT_MODE_SECTION    : '@section' -> type(SECTION), popMode;
  COMMENT_MODE_IF         : '#if' -> type(IF), popMode;
  COMMENT_MODE_ENDIF      : '#endif' -> type(ENDIF), popMode;
  COMMENT_MODE_LINE_BREAK : [\r\n]+ -> skip, popMode;
  
  COMMENT_MODE_PART       : ~[\r\n];

CustomParser.g4:

parser grammar CustomParser;

options { tokenVocab=CustomLexer; }

compilationUnit
 : statement* EOF
 ;

statement
 : comment? pragmaDirective
 | comment? defineDirective
 | comment? undefDirective
 | comment? ifDirective
 | comment? ifdefDirective
 | comment? ifndefDirective
 | sectionLineComment
 | comment
 ;

pragmaDirective
 :   PRAGMA char_sequence
 ;

subDirectives
 : ifDirective+
 | ifdefDirective+
 | ifndefDirective+
 | defineDirective+
 | undefDirective+
 | comment+
 ;

ifdefDirective
 : IFDEF IDENTIFIER subDirectives+ ENDIF
 ;

ifndefDirective
 : IFNDEF IDENTIFIER subDirectives+ ENDIF
 ;

ifDirective
 : ifStatement elseIfStatement* elseStatement? ENDIF
 ;

ifStatement
 : IF expression (subDirectives)*
 ;

elseIfStatement
 : ELIF expression (subDirectives)*
 ;

elseStatement
 : ELSE (subDirectives)*
 ;

defineDirective
 : BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER BOOLEAN_LITERAL info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER (char_sequence COMA?)+ info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER OPAREN? NUMBER LITERAL_SUFFIX? CPAREN? info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER HEXADECIMAL info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER STRING info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER OBRACE? (ARRAY_SEQUENCE COMA?)+ CBRACE? info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER expression info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER info_comment?
 ;

undefDirective
 : BLOCK_COMMENT? COMMENT_START? UNDEF IDENTIFIER info_comment?;

sectionLineComment
 : COMMENT_START COMMENT_MODE_PART? SECTION char_sequence
 ;

comment
 : BLOCK_COMMENT
 | line_comment+
 ;

expression
 : simpleExpression
 | customExpression
 | enabledExpression
 | disabledExpression
 | bothExpression
 | eitherExpression
 | anyExpression
 | definedExpression
 | comparisonExpression
 | arithmeticExpression
 ;

arithmeticExpression
 : arithmeticExpression  (MULTIPLY | DIVIDE) arithmeticExpression
 | arithmeticExpression (ADD | SUBTRACT) arithmeticExpression
 | OPAREN arithmeticExpression CPAREN
 | expressionIdentifier
 ;

comparisonExpression
 : comparisonExpression (EQUALS | NEQUALS | GTEQUALS | LTEQUALS | GT | LT) comparisonExpression
 | comparisonExpression (AND | OR) comparisonExpression
 | EXCL? OPAREN comparisonExpression CPAREN
 | eitherExpression
 | enabledExpression
 | bothExpression
 | anyExpression
 | definedExpression
 | disabledExpression
 | customExpression
 | simpleExpression
 | expressionIdentifier
 ;

enabledExpression : EXCL? OPAREN? ENABLED OPAREN IDENTIFIER CPAREN CPAREN?;
disabledExpression : EXCL? OPAREN? DISABLED OPAREN IDENTIFIER CPAREN CPAREN?;
bothExpression : EXCL? OPAREN? BOTH OPAREN identifiers identifiers CPAREN CPAREN?;
eitherExpression : EXCL? OPAREN? EITHER OPAREN identifiers+ CPAREN CPAREN?;
anyExpression : EXCL? OPAREN? ANY OPAREN identifiers+ CPAREN CPAREN?;
definedExpression : EXCL? OPAREN? DEFINED OPAREN IDENTIFIER CPAREN CPAREN?;
customExpression : EXCL? IDENTIFIER OPAREN IDENTIFIER CPAREN;
simpleExpression : EXCL? IDENTIFIER;
expressionIdentifier : IDENTIFIER | NUMBER;

identifiers
 : IDENTIFIER COMA?
 ;

line_comment
 : COMMENT_START COMMENT_MODE_PART*
 ;

info_comment
 : COMMENT_START COMMENT_MODE_PART*
 ;

char_sequence
 : CHAR_SEQUENCE
 | IDENTIFIER
 ;

It is working fine with 95% of the directives and comments I have in my header file but few scenarios still not correctly handled:

1. Line comments

Input:

//1
//#define ID1 //2

This is the list of tokens:

01. compilationUnit
02.  statement:2
03.    comment:2
04.      line_comment
05.        COMMENT_START: "//"
06.        COMMENT_MODE_PART: "1"
07.      line_comment
08.        COMMENT_START: "//"
09.    defineDirective:8
10.      DEFINE: "#define"
11.      IDENTIFIER: "ID1"
12.      info_comment
13.        COMMENT_START: "//"
14.        COMMENT_MODE_PART: "2"
15.<EOF>

I want to achieve that the token on line 07 is a part of the token on line 09 and resolved as COMMENT_START token

2. Define directive with text

Other define rules are working correctly but:

#define USER_DESC_2 "abc " DEF "ABC2 \" M100 (100) 
#define USER_GCODE_2 "M140 S" STRINGIFY(PREHEAT_1_TEMP_BED) "\nM104 S" STRINGIFY(PREHEAT_1_TEMP_HOTEND)

These "define" directives are parsing with an exception

I would appreciate any help with resolving these 2 problems I have at this moment or any recommendations on how my lexer/parser can be optimized.

Thanks in advance!

=================================Update=================================== First test case:

Input:

//1
//#define ID1 //2

Current result:

01. compilationUnit
02.  statement:2
03.    comment:2
04.      line_comment
05.        COMMENT_START: "//"
06.        COMMENT_MODE_PART: "1"
07.      line_comment
08.        COMMENT_START: "//"
09.    defineDirective:8
10.      DEFINE: "#define"
11.      IDENTIFIER: "ID1"
12.      info_comment
13.        COMMENT_START: "//"
14.        COMMENT_MODE_PART: "2"
15.<EOF>

Expected result:

01. compilationUnit
02.  statement:2
03.    comment:2
04.      line_comment
05.        COMMENT_START: "//"
06.        COMMENT_MODE_PART: "1"
07.    defineDirective:8
08.      COMMENT_START: "//"  
09.      DEFINE: "#define"
10.      IDENTIFIER: "ID1"
11.      info_comment
12.        COMMENT_START: "//"
13.        COMMENT_MODE_PART: "2"
14.<EOF>

Second test case:

Input:

#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL

Current result:

01.compilationUnit
02. statement:2
03.  defineDirective:5
04.   DEFINE: "#define"
05.   IDENTIFIER: "USER_DESC_2"
06.   STRING: "\"Preheat for \""
07.  IDENTIFIER: "PREHEAT_1_LABEL"
<EOF>

Expected result:

01.compilationUnit
02. statement:2
03.  defineDirective:5
04.   DEFINE: "#define"
05.   IDENTIFIER: "USER_DESC_2"
06.   STRING: "\"Preheat for \" PREHEAT_1_LABEL"
<EOF>

In the expected result, STRING represents the result text. Here I do not really know if it is better to enhance STRING Lexer token definition or introduce new parsing rule to cover this case

I'am working on a better solution. Can you confirm that the #define directive is actually optional_// #define IDENTIFIER replacement_value optional_line_comment ? — BernardK

BernardK BernardK · Accepted Answer · 2021-01-07T14:33:31

Mixing this post, your previous question and Bart's answer, and supposing that a define directive is in the form

optional_// #define IDENTIFIER replacement_value optional_line_comment

and given the input file input.txt

/**
 * BLOCK COMMENT
 */
#pragma once
//#pragma once

/**
 * BLOCK COMMENT
 */
#define CONFIGURATION_H_VERSION 12345

#define IDENTIFIER abcd
#define IDENTIFIER_1 abcd
#define IDENTIFIER_1 abcd.dd

#define IDENTIFIER_2 true // Line
#define IDENTIFIER_20 {ONE, TWO} // Line
#define IDENTIFIER_20_30   { 1, 2, 3, 4 }
#define IDENTIFIER_20_30_A   [ 1, 2, 3, 4 ]
#define DEFAULT_A 10.0

//================================================================
//============================= INFO =============================
//================================================================

/**
 * SEPARATE BLOCK COMMENT
 */

// Line 1
// Line 2
//

//======================= this is a section ======================
// @section test

// Line 3
#define IDENTIFIER_TWO "(ONE, TWO, THREE)" // Line 4
//#define IDENTIFIER_3 Version.h // Line 5

// Line 6
#define IDENTIFIER_THREE

//1
//#define ID1 //2

#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL

#define USER_DESC_2 "abc " DEF "ABC2 \" M100 (100) 
#define USER_GCODE_2 "M140 S" STRINGIFY(PREHEAT_1_TEMP_BED) "\nM104 S" STRINGIFY(PREHEAT_1_TEMP_HOTEND)

if I have well understood your two questions, the grammar must produce a statement for each directive or comment not followed by a directive. A directive can be preceded by a comment, which becomes part of the statement. A directive can be commented out and followed by an inline line comment (that is, on the same line).

Grammar Header.g4 (without trace) :

grammar Header;

compilationUnit
    @init {System.out.println("Last update 1253");}
    :   ( statement {System.out.println("Statement found : `" + $statement.text + "`");}
        )* EOF
    ;

statement
    :   comment? pragma_directive
    |   comment? define_directive
    |   section
    |   comment
    ;

pragma_directive
     :   PRAGMA char_sequence
     ;

define_directive
    :   define_identifier replacement_comment[$define_identifier.statement_line]
    ;
    
define_identifier returns [int statement_line]
    :   LINE_COMMENT_DELIMITER? DEFINE {$statement_line = getCurrentToken().getLine();} IDENTIFIER
    ;

replacement_comment [int statement_line]
    :   anything+ line_comment?
    |   {getCurrentToken().getLine() == $statement_line}? line_comment
    |   {getCurrentToken().getLine() != $statement_line}?
    ;

section
    :   LINE_COMMENT_DELIMITER OTHER? SECTION char_sequence
    ;

comment
    :   BLOCK_COMMENT
    |   line_comment
    |   SEPARATOR ( IDENTIFIER | EQUALS )*
    ;

line_comment
    :   LINE_COMMENT_DELIMITER anything*
    ;

anything
    :   IDENTIFIER
    |   CHAR_SEQUENCE 
    |   STRING
    |   NUMBER
    |   OTHER
    ;

char_sequence
    :   CHAR_SEQUENCE
    |   IDENTIFIER
    ;
 
LINE_COMMENT_DELIMITER : '//' ;
PRAGMA        : '#pragma';
SECTION       : '@section';
DEFINE        : '#define';
STRING        : '"' .*? '"';
EQUALS        : '='+ ;
SEPARATOR     : LINE_COMMENT_DELIMITER EQUALS ;
IDENTIFIER    : [a-zA-Z_] [a-zA-Z_0-9]*;
CHAR_SEQUENCE : [a-zA-Z_] [a-zA-Z_0-9.]*;
NUMBER        : [0-9.]+ ;
BLOCK_COMMENT : '/**' .*? '*/';
WS            : [ \t]+ -> channel(HIDDEN) ;
NL            : (   '\r' '\n'?
                  | '\n'
                ) -> channel(HIDDEN) ;
OTHER         : . ;

Execution :

$ export CLASSPATH=".:/usr/local/lib/antlr-4.9-complete.jar"
$ alias a4='java -jar /usr/local/lib/antlr-4.9-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Header.g4 
$ javac Header*.java
$ grun Header compilationUnit -tokens input.txt
[@0,0:23='/**\n * BLOCK COMMENT\n */',<BLOCK_COMMENT>,1:0]
[@1,24:24='\n',<NL>,channel=1,3:3]
[@2,25:31='#pragma',<'#pragma'>,4:0]
[@3,32:32=' ',<WS>,channel=1,4:7]
[@4,33:36='once',<IDENTIFIER>,4:8]
[@5,37:37='\n',<NL>,channel=1,4:12]
...
[@84,315:321='#define',<'#define'>,19:0]
[@85,322:322=' ',<WS>,channel=1,19:7]
[@86,323:340='IDENTIFIER_20_30_A',<IDENTIFIER>,19:8]
[@87,341:343='   ',<WS>,channel=1,19:26]
[@88,344:344='[',<OTHER>,19:29]
[@89,345:345=' ',<WS>,channel=1,19:30]
[@90,346:346='1',<NUMBER>,19:31]
[@91,347:347=',',<OTHER>,19:32]
...
[@139,644:668='//=======================',<SEPARATOR>,34:0]
[@140,669:669=' ',<WS>,channel=1,34:25]
[@141,670:673='this',<IDENTIFIER>,34:26]
...
[@257,1103:1102='<EOF>',<EOF>,51:0]
Last update 1253
Statement found : `/**
 * BLOCK COMMENT
 */
#pragma once`
Statement found : `//#pragma once`
...
Statement found : `#define DEFAULT_A 10.0`
...
Statement found : `// Line 2`
Statement found : `//`
...
Statement found : `//#define IDENTIFIER_3 Version.h // Line 5`
Statement found : `// Line 6
#define IDENTIFIER_THREE`
Statement found : `//1
//#define ID1 //2`
Statement found : `#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL`
Statement found : `#define USER_DESC_2 "abc " DEF "ABC2 \" M100 (100)`
Statement found : `#define USER_GCODE_2 "M140 S" STRINGIFY(PREHEAT_1_TEMP_BED) "\nM104 S" STRINGIFY(PREHEAT_1_TEMP_HOTEND)`

Grammar Header_trace.g4 (with trace) :

grammar Header_trace;

compilationUnit
    @init {System.out.println("Last update 1137");}
    :   statement[this.getRuleNames() /* parser rule names */]* EOF
    ;

statement [String[] rule_names]
    locals [String rule_name, int start_line, int end_line]
    @after { System.out.print("The next statement is a " + $rule_name);
             $start_line = $start.getLine();
             $end_line   = $stop.getLine();
             if ($start_line == $end_line)
                 System.out.print(" on line " + $start_line);
             else
                 System.out.print(" on lines " + $start_line + " to " + $end_line);
             System.out.println(" : ");
             System.out.println("`" + $text + "`");
           }
    :   comment? pragma_directive [rule_names] {$rule_name = $pragma_directive.rule_name;}
    |   comment? define_directive [rule_names] {$rule_name = $define_directive.rule_name;}
    |   section [rule_names]                   {$rule_name = $section.rule_name;}
    |   comment_only [rule_names]              {$rule_name = $comment_only.rule_name;}
     // comment_only can be replaced by comment when the trace is removed
    ;

pragma_directive [String[] rule_names] returns [String rule_name]
     :   PRAGMA char_sequence
            { $rule_name = rule_names[$ctx.getRuleIndex()]; }
     ;

define_directive [String[] rule_names] returns [String rule_name]
    locals [String dir_rule_name, int statement_line = 0]
    @init {$dir_rule_name = rule_names[_localctx.getRuleIndex()];}
    :   define_identifier replacement_comment[$dir_rule_name, $define_identifier.statement_line]
            { $rule_name = $replacement_comment.rule_name; }
    ;
    
define_identifier returns [int statement_line]
    :   LINE_COMMENT_DELIMITER? DEFINE {$statement_line = getCurrentToken().getLine();} IDENTIFIER
    ;

replacement_comment [String dir_rule_name, int statement_line] returns [String rule_name]
    :   any+=anything+ line_comment?
            { $rule_name = $dir_rule_name + " with replacement value";
              System.out.print("          anything matched : " );
              if ($any.size() > 0)
                  for (AnythingContext r : $any)
                      System.out.print(r.getText());
              else
                  System.out.print("(nothing)");

              System.out.println();
            }
    |   {getCurrentToken().getLine() == $statement_line}?
        line_comment
            { $rule_name = $dir_rule_name + " WITHOUT replacement value and with inline line comment"; }
    |   {getCurrentToken().getLine() != $statement_line}?
            { $rule_name = $dir_rule_name + " WITHOUT replacement value"; }
    ;

section [String[] rule_names] returns [String rule_name]
    :   LINE_COMMENT_DELIMITER OTHER? SECTION char_sequence
            { $rule_name = rule_names[$ctx.getRuleIndex()]; }
    ;

comment_only [String[] rule_names] returns [String rule_name]
    :   comment
            { $rule_name = rule_names[$ctx.getRuleIndex()]; }
    ;

comment
    :   BLOCK_COMMENT
    |   line_comment
    |   SEPARATOR ( IDENTIFIER | EQUALS )*
    ;

line_comment
    :   LINE_COMMENT_DELIMITER anything*
    ;

anything
    :   IDENTIFIER
    |   CHAR_SEQUENCE 
    |   STRING
    |   NUMBER
    |   OTHER
    ;

char_sequence
    :   CHAR_SEQUENCE
    |   IDENTIFIER
    ;
 
LINE_COMMENT_DELIMITER : '//' ;
PRAGMA        : '#pragma';
SECTION       : '@section';
DEFINE        : '#define';
STRING        : '"' .*? '"';
EQUALS        : '='+ ;
SEPARATOR     : LINE_COMMENT_DELIMITER EQUALS ;
IDENTIFIER    : [a-zA-Z_] [a-zA-Z_0-9]*;
CHAR_SEQUENCE : [a-zA-Z_] [a-zA-Z_0-9.]*;
NUMBER        : [0-9.]+ ;
BLOCK_COMMENT : '/**' .*? '*/';
WS            : [ \t]+ -> channel(HIDDEN) ;
NL            : (   '\r' '\n'?
                  | '\n'
                ) -> channel(HIDDEN) ;
OTHER         : .;

Execution :

$ a4 Header_trace.g4 
$ javac Header*.java
$ grun Header_trace compilationUnit -tokens input.txt
[@0,0:23='/**\n * BLOCK COMMENT\n */',<BLOCK_COMMENT>,1:0]
[@1,24:24='\n',<NL>,channel=1,3:3]
[@2,25:31='#pragma',<'#pragma'>,4:0]
[@3,32:32=' ',<WS>,channel=1,4:7]
[@4,33:36='once',<IDENTIFIER>,4:8]
[@5,37:37='\n',<NL>,channel=1,4:12]
...
[@257,1103:1102='<EOF>',<EOF>,51:0]
Last update 1137
The next statement is a pragma_directive on lines 1 to 4 : 
`/**
 * BLOCK COMMENT
 */
#pragma once`
...
          anything matched : 10.0
The next statement is a define_directive with replacement value on line 20 : 
`#define DEFAULT_A 10.0`
The next statement is a comment_only on line 22 : 
`//================================================================`
...
The next statement is a comment_only on line 31 : 
`// Line 2`
The next statement is a comment_only on line 32 : 
`//`
...
          anything matched : Version.h
The next statement is a define_directive with replacement value on line 39 : 
`//#define IDENTIFIER_3 Version.h // Line 5`
The next statement is a define_directive WITHOUT replacement value on lines 41 to 42 : 
`// Line 6
#define IDENTIFIER_THREE`
The next statement is a define_directive WITHOUT replacement value and with inline line comment on lines 44 to 45 : 
`//1
//#define ID1 //2`
          anything matched : "Preheat for "PREHEAT_1_LABEL
The next statement is a define_directive with replacement value on line 47 : 
`#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL`
...

It happened that thanks to LINE_COMMENT_DELIMITER?, as you did with COMMENT_START?, at the beginning of the define directive rule, and because there is no special token after //, it was no longer necessary to switch to mode COMMENT_MODE when encountering a line comment delimiter.

There was one difficulty with this first approach :

define_directive
    :   LINE_COMMENT_DELIMITER? DEFINE IDENTIFIER anything+ line_comment?
    |   LINE_COMMENT_DELIMITER? DEFINE {$statement_line = getCurrentToken().getLine();}
        IDENTIFIER same_line_line_comment[$statement_line]
    |   LINE_COMMENT_DELIMITER? DEFINE IDENTIFIER

same_line_line_comment [int statement_line]
    :   {getCurrentToken().getLine() == $statement_line}?
        line_comment

The following lines

// Line 6
#define IDENTIFIER_THREE

//1

were parsed with the second alternative instead of the third :

compare statement line 42 with comment line 44
line 44:0 rule same_line_line_comment failed predicate: {getCurrentToken().getLine() == $statement_line}?
The next statement is a define_directive WITHOUT replacement value and with inline line comment on lines 41 to 42 : 
`// Line 6
#define IDENTIFIER_THREE`

Despite the fact that the subrule same_line_line_comment was guarded with a false value, the semantic predicate had no effect. The FailedPredicateException was undesirable and the trace message was wrong. It may have to do with Finding Visible Predicates.

The solution was to split the processing of the #define directive into a fixed part define_identifier rule and a variable part replacement_comment rule with the semantic predicate (which, to be effective in the parsing decision, must be placed at the beginning of the alternative).

ANTLR4 line comments and text parsing issue

1 Answers