I want to parse cpp pre-processing directives while skipping over all other cpp syntaxes. In particular, I need to differentiate between function like and object like macros:
# define abc(x,y) x##y //function like macro & token pasting operator
# define abc (a,b,c) //object like macro, replace 'abc' by '(a,b,c)'
The key difference is that a function like macro does not have any hidden tokens (whitespace or multi-line comment) between the identifier abc and the left parenthesis following it.
But the problem is, I've already dropped all multiline comments and whitespace in the lexer to hidden channels. So how is it possible to identify a whitespace before the left parenthesis?
The lexer grammar I tried is like this:
CRLF: '\r'? '\n' -> channel(LINEBREAK);
WS: [ \t\f]+ -> channel(WHITESPACE);
ML_COMMENT: '/*' .*? '*/' -> channel(COMMENTS);
SL_COMMENT: '//' ~[\r\n]* -> channel(COMMENTS);
PPHASH: {getCharPositionInLine() == 0}? (ML_COMMENT | [ \t\f])* '#'
; //any line starts with a # as the first char (comments, ws before it skipped)
CHARACTER_LITERAL : 'L'? '\'' (CH_ESC |~['\r\n\\])*? '\'' ; //not exactly 1 char between '' e.g. '0x05'
fragment CH_ESC : '\\' . ;
STRING_LITERAL: 'L'? '"' (STR_ESC | ~["\r\n\\])*? '"' ;
fragment STR_ESC: '\\' . ;
ANY_ID: [_0-9a-zA-Z]+ ;
ALL_SYMBOL:
'~' | '!' | '@' | '#' | '$' | '%' | '^' | '&' | '*' | '=' | '-' | '+' | '\\'| '|' | ':' | ';' | '"' | '\''|
'<' | '>' | '.' | '?' | '/' | ',' | '[' | ']' | '(' | ')' | '{' | '}'
; //basically everything found in a keyboard
I intend to tell the parser the start of a preprocessing directive by a PPHASH token. It is a '#' at the start of a line.
My incorrect parser grammar for a #define line:
define_line:
PPHASH 'define' (function_like_define | object_like_define)
;
//--- function like define ---
function_like_define:
ANY_ID '(' parameter_seq? ')' fl_replacement_string
;
parameter_seq: ANY_ID ( ',' ANY_ID)* ;
//--- object like define ---
object_like_define:
ANY_ID ol_replacement_string
;
//fl&ol different names, visitor no need to test parent. Separate rule to make it a single node supporting getText()
fl_replacement_string: any_non_crlf_token*;
ol_replacement_string: any_non_crlf_token*;
any_non_crlf_token:ANY_ID | .....;
This grammar incorrectly treats a #define abc (a,b,c) as a function like macro. How to fix the grammar?