0
votes

I want to parse cpp pre-processing directives while skipping over all other cpp syntaxes. In particular, I need to differentiate between function like and object like macros:

# define abc(x,y) x##y  //function like macro & token pasting operator
# define abc (a,b,c)   //object like macro, replace 'abc' by '(a,b,c)'

The key difference is that a function like macro does not have any hidden tokens (whitespace or multi-line comment) between the identifier abc and the left parenthesis following it.

But the problem is, I've already dropped all multiline comments and whitespace in the lexer to hidden channels. So how is it possible to identify a whitespace before the left parenthesis?

The lexer grammar I tried is like this:

CRLF: '\r'? '\n' -> channel(LINEBREAK);

WS: [ \t\f]+ -> channel(WHITESPACE);

ML_COMMENT: '/*'  .*? '*/' -> channel(COMMENTS);
SL_COMMENT: '//' ~[\r\n]*  -> channel(COMMENTS); 

PPHASH: {getCharPositionInLine() == 0}? (ML_COMMENT | [ \t\f])* '#'
; //any line starts with a # as the first char (comments, ws before it skipped)

CHARACTER_LITERAL : 'L'? '\'' (CH_ESC |~['\r\n\\])*? '\'' ;  //not exactly 1 char between '' e.g. '0x05'
fragment CH_ESC :  '\\' . ;
STRING_LITERAL: 'L'? '"' (STR_ESC | ~["\r\n\\])*? '"'  ;
fragment STR_ESC:  '\\'  .  ;

ANY_ID: [_0-9a-zA-Z]+ ;

ALL_SYMBOL:
  '~' | '!' | '@' | '#' | '$' | '%' | '^' | '&' | '*' | '=' | '-' | '+' | '\\'| '|' | ':' | ';' | '"' | '\''|
  '<' | '>' | '.' | '?' | '/' | ',' | '[' | ']' | '(' | ')' | '{' | '}'

; //basically everything found in a keyboard

I intend to tell the parser the start of a preprocessing directive by a PPHASH token. It is a '#' at the start of a line.

My incorrect parser grammar for a #define line:

define_line:
  PPHASH 'define'  (function_like_define | object_like_define)
;
//--- function like define ---
function_like_define:
  ANY_ID '(' parameter_seq? ')'  fl_replacement_string
;
parameter_seq:  ANY_ID ( ',' ANY_ID)* ;

//--- object like define ---
object_like_define:
  ANY_ID ol_replacement_string
;
//fl&ol different names, visitor no need to test parent. Separate rule to make it a single node supporting getText()
fl_replacement_string: any_non_crlf_token*;
ol_replacement_string: any_non_crlf_token*;
any_non_crlf_token:ANY_ID | .....;

This grammar incorrectly treats a #define abc (a,b,c) as a function like macro. How to fix the grammar?

1

1 Answers

0
votes

Parsing preprocessor tokens with all the bells and whistles (skipping anything disabled, line splicing, macro handling, stringizing, charizing and all that) is not a task for a general parser. You should rather implement an own input stream that handles the preprocessor macros (which involves an expression parser for the conditions). This stream would then evaluate what is enabled at the time the input is read and skips over it until the #else or #endif is found. #if / #ifdef is line based so you can easily do that by reading line by line.

I have gone through this process several years ago and offer the result on my homepage free for download: http://www.soft-gems.net/index.php/java/windows-resource-file-parser-and-converter. This project is a parser for Windows .rc files, but because of their nature, completely implements a .h file parser + expression evaluator, macro expansion etc.