0
votes

I'm building a parser for a language containing preprocessor instructions in special preprocessor sections (enclosed by { and }). One of them is similar to the C #define.

I'd like to lex the file in one run using an island grammar for the preprocessor parts. When I hit the #define instruction, I'd like to include another island grammar which contains all the tokens (approx. 200) of the "regular" part, except the preprocessing region start token and emits the tokens on a different channel and of course has a stop token which returns to the preprocessor island grammar. It is not really vital that the preprocessor region starting token { is really removed since the files I parse are valid, but would be nice.

Is there a way to "reuse" the lexer rules for two modes (I can emit to a named non-const channel which value I could change upon entering/leaving the island)?

Here's some sample source file:

int a = 42;

{ // start preprocessor section

// simple single line #define
#define ABC 42

// will be fix "2 * 42" even if ABS is changed later on
#define DEF 2 * ABC

// multiple line define (all but last line needs to have a "\" before the newline
#define GHI   3 \
              + 4

// the definition can contain (almost) arbitrary code, except line comments, preprocessor sections and preprocessor statements
#define JKL  if (a > 23) then b = c + d; str = "} <- this must not be the end of the preprocessor section"; end_if;              

} // end preprocessor section
2

2 Answers

2
votes

You cannot currently reuse/import a lexer rule defined in one mode into another. I typically do something like the following.

LBRACE : '{';

mode OtherMode;

  OtherMode_LBRACE : LBRACE -> type(LBRACE);

Due to a code generation optimization in the ANTLR 4 Tool, constructs like the above will not actually create a separate OtherMode_LBRACE token due to the use of the type lexer command in that rule.

0
votes

I finally took the approach to simply duplicate the rules as Sam proposed. I first tried to go back to main mode from the other mode and check at certain position if I shall return with embedded actions and to also use actions that set the "current active channel" and add these to all rules but this cluttered up all rules and did not "feel the right thing to do".

To reduce the manual work I reorganized my lexer file and can now quite easily copy the rules to be included in the other mode and use this regular expression to transform them and insert them in the other modes section:

search for:

^\s*(([a-z_0-9]+)[ \t\n]*)(:.*?;)\n

replace with (PPP_ is just a prefix I chose):

 PPP_$1 : $2 -> type\($2\), channel\(2\); \n\t\t\t/*$3*/\n

and use regular expression options "ignore case", "single line", "multiline"

If you don't want to emit the tokens to another channel, remove the channel(2) part. The edit I chose (notepad++) required me to escape braces, yours may not.