I'm writing the parser of c++ header style file and facing the issue with correct line comment handling.
CustomLexer.g4
lexer grammar CustomLexer;
SPACES : [ \r\n\t]+ -> skip;
COMMENT_START : '//' -> pushMode(COMMENT_MODE);
PRAGMA : '#pragma';
SECTION : '@section';
DEFINE : '#define';
UNDEF : '#undef';
IF : '#if';
ELIF : '#elif';
ELSE : '#else';
IFDEF : '#ifdef';
IFNDEF : '#ifndef';
ENDIF : '#endif';
ENABLED : 'ENABLED';
DISABLED : 'DISABLED';
EITHER : 'EITHER';
ANY : 'ANY';
DEFINED : 'defined';
BOTH : 'BOTH';
BOOLEAN_LITERAL : 'true' | 'false';
STRING : '"' .*? '"';
HEXADECIMAL : '0x' ([a-fA-F0-9])+;
LITERAL_SUFFIX : 'L'|'u'|'U'|'Lu'|'LU'|'uL'|'UL'|'f'|'F';
IDENTIFIER : [a-zA-Z_] [a-zA-Z_0-9]*;
BLOCK_COMMENT : '/**' .*? '*/';
NUMBER : ('-')? Int ('.' Digit*)? | '0';
CHAR_SEQUENCE : [a-zA-Z_] [a-zA-Z_0-9.]*;
ARRAY_SEQUENCE : '{' .*? '}';
OPAREN : '(';
CPAREN : ')';
OBRACE : '{';
CBRACE : '}';
ADD : '+';
SUBTRACT : '-';
MULTIPLY : '*';
DIVIDE : '/';
MODULUS : '%';
OR : '||';
AND : '&&';
EQUALS : '==';
NEQUALS : '!=';
GTEQUALS : '>=';
LTEQUALS : '<=';
GT : '>';
LT : '<';
EXCL : '!';
QMARK : '?';
COLON : ':';
COMA : ',';
OTHER : .;
fragment Int : [0-9] Digit* | '0';
fragment Digit : [0-9];
mode COMMENT_MODE;
COMMENT_MODE_DEFINE : '#define' -> type(DEFINE), popMode;
COMMENT_MODE_SECTION : '@section' -> type(SECTION), popMode;
COMMENT_MODE_IF : '#if' -> type(IF), popMode;
COMMENT_MODE_ENDIF : '#endif' -> type(ENDIF), popMode;
COMMENT_MODE_LINE_BREAK : [\r\n]+ -> skip, popMode;
COMMENT_MODE_PART : ~[\r\n];
CustomParser.g4:
parser grammar CustomParser;
options { tokenVocab=CustomLexer; }
compilationUnit
: statement* EOF
;
statement
: comment? pragmaDirective
| comment? defineDirective
| comment? undefDirective
| comment? ifDirective
| comment? ifdefDirective
| comment? ifndefDirective
| sectionLineComment
| comment
;
pragmaDirective
: PRAGMA char_sequence
;
subDirectives
: ifDirective+
| ifdefDirective+
| ifndefDirective+
| defineDirective+
| undefDirective+
| comment+
;
ifdefDirective
: IFDEF IDENTIFIER subDirectives+ ENDIF
;
ifndefDirective
: IFNDEF IDENTIFIER subDirectives+ ENDIF
;
ifDirective
: ifStatement elseIfStatement* elseStatement? ENDIF
;
ifStatement
: IF expression (subDirectives)*
;
elseIfStatement
: ELIF expression (subDirectives)*
;
elseStatement
: ELSE (subDirectives)*
;
defineDirective
: BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER BOOLEAN_LITERAL info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER (char_sequence COMA?)+ info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER OPAREN? NUMBER LITERAL_SUFFIX? CPAREN? info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER HEXADECIMAL info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER STRING info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER OBRACE? (ARRAY_SEQUENCE COMA?)+ CBRACE? info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER expression info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER info_comment?
;
undefDirective
: BLOCK_COMMENT? COMMENT_START? UNDEF IDENTIFIER info_comment?;
sectionLineComment
: COMMENT_START COMMENT_MODE_PART? SECTION char_sequence
;
comment
: BLOCK_COMMENT
| line_comment+
;
expression
: simpleExpression
| customExpression
| enabledExpression
| disabledExpression
| bothExpression
| eitherExpression
| anyExpression
| definedExpression
| comparisonExpression
| arithmeticExpression
;
arithmeticExpression
: arithmeticExpression (MULTIPLY | DIVIDE) arithmeticExpression
| arithmeticExpression (ADD | SUBTRACT) arithmeticExpression
| OPAREN arithmeticExpression CPAREN
| expressionIdentifier
;
comparisonExpression
: comparisonExpression (EQUALS | NEQUALS | GTEQUALS | LTEQUALS | GT | LT) comparisonExpression
| comparisonExpression (AND | OR) comparisonExpression
| EXCL? OPAREN comparisonExpression CPAREN
| eitherExpression
| enabledExpression
| bothExpression
| anyExpression
| definedExpression
| disabledExpression
| customExpression
| simpleExpression
| expressionIdentifier
;
enabledExpression : EXCL? OPAREN? ENABLED OPAREN IDENTIFIER CPAREN CPAREN?;
disabledExpression : EXCL? OPAREN? DISABLED OPAREN IDENTIFIER CPAREN CPAREN?;
bothExpression : EXCL? OPAREN? BOTH OPAREN identifiers identifiers CPAREN CPAREN?;
eitherExpression : EXCL? OPAREN? EITHER OPAREN identifiers+ CPAREN CPAREN?;
anyExpression : EXCL? OPAREN? ANY OPAREN identifiers+ CPAREN CPAREN?;
definedExpression : EXCL? OPAREN? DEFINED OPAREN IDENTIFIER CPAREN CPAREN?;
customExpression : EXCL? IDENTIFIER OPAREN IDENTIFIER CPAREN;
simpleExpression : EXCL? IDENTIFIER;
expressionIdentifier : IDENTIFIER | NUMBER;
identifiers
: IDENTIFIER COMA?
;
line_comment
: COMMENT_START COMMENT_MODE_PART*
;
info_comment
: COMMENT_START COMMENT_MODE_PART*
;
char_sequence
: CHAR_SEQUENCE
| IDENTIFIER
;
It is working fine with 95% of the directives and comments I have in my header file but few scenarios still not correctly handled:
1. Line comments
Input:
//1
//#define ID1 //2
This is the list of tokens:
01. compilationUnit
02. statement:2
03. comment:2
04. line_comment
05. COMMENT_START: "//"
06. COMMENT_MODE_PART: "1"
07. line_comment
08. COMMENT_START: "//"
09. defineDirective:8
10. DEFINE: "#define"
11. IDENTIFIER: "ID1"
12. info_comment
13. COMMENT_START: "//"
14. COMMENT_MODE_PART: "2"
15.<EOF>
I want to achieve that the token on line 07 is a part of the token on line 09 and resolved as COMMENT_START token
2. Define directive with text
Other define rules are working correctly but:
#define USER_DESC_2 "abc " DEF "ABC2 \" M100 (100)
#define USER_GCODE_2 "M140 S" STRINGIFY(PREHEAT_1_TEMP_BED) "\nM104 S" STRINGIFY(PREHEAT_1_TEMP_HOTEND)
These "define" directives are parsing with an exception
I would appreciate any help with resolving these 2 problems I have at this moment or any recommendations on how my lexer/parser can be optimized.
Thanks in advance!
=================================Update=================================== First test case:
Input:
//1
//#define ID1 //2
Current result:
01. compilationUnit
02. statement:2
03. comment:2
04. line_comment
05. COMMENT_START: "//"
06. COMMENT_MODE_PART: "1"
07. line_comment
08. COMMENT_START: "//"
09. defineDirective:8
10. DEFINE: "#define"
11. IDENTIFIER: "ID1"
12. info_comment
13. COMMENT_START: "//"
14. COMMENT_MODE_PART: "2"
15.<EOF>
Expected result:
01. compilationUnit
02. statement:2
03. comment:2
04. line_comment
05. COMMENT_START: "//"
06. COMMENT_MODE_PART: "1"
07. defineDirective:8
08. COMMENT_START: "//"
09. DEFINE: "#define"
10. IDENTIFIER: "ID1"
11. info_comment
12. COMMENT_START: "//"
13. COMMENT_MODE_PART: "2"
14.<EOF>
Second test case:
Input:
#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL
Current result:
01.compilationUnit
02. statement:2
03. defineDirective:5
04. DEFINE: "#define"
05. IDENTIFIER: "USER_DESC_2"
06. STRING: "\"Preheat for \""
07. IDENTIFIER: "PREHEAT_1_LABEL"
<EOF>
Expected result:
01.compilationUnit
02. statement:2
03. defineDirective:5
04. DEFINE: "#define"
05. IDENTIFIER: "USER_DESC_2"
06. STRING: "\"Preheat for \" PREHEAT_1_LABEL"
<EOF>
In the expected result, STRING represents the result text. Here I do not really know if it is better to enhance STRING Lexer token definition or introduce new parsing rule to cover this case
#define
directive is actuallyoptional_// #define IDENTIFIER replacement_value optional_line_comment
? – BernardK