Semantic lexer predicate performance

Question

I have a lexer creates MACRO tokens for a dynamic list of macro strings passed to the lexer. I used a semantic predicate in the very top lexer rule to implement this feature:

MACRO: { macros != null && tryMacro() }? .;

Where tryMacro() just checks if any macro string matches the input sequence.

The performance of this approach was very bad and after some research I tried changing the lexer rule to the following:

MACRO: . { macros != null && tryMacro() }?;

This severely improved the performance but I don't really understand why. :) Since the '.' matches any character, the semantic predicate rule should be invoked exactly as many times as before, shouldn't it? Can someone provide an explanation for this behavior?

This problem is well known. I described in on my ANLTR4 C++ target blog post: soft-gems.net/index.php/tools/49-the-antlr4-c-target-is-here. Predicates on the left hand side of a rule prevent the DFA generated for it not to be cached, but cause them to be evaluated again and again. This is especially bad if you have many alts, e.g. many keywords (lexer rules are kinda alts of a hidden root context). — Mike Lischke
I moved the site. The link has changed to: soft-gems.net/the-antlr4-c-target-is-here — Mike Lischke

Lucas Trzesniewski Lucas Trzesniewski · Accepted Answer · 2016-09-14T15:28:41

The reason is pretty simple: if you put the predicate at the start, the lexer will evaluate it to decide if the MACRO rule should apply. If you put it at the end, it will only perform the check when it has a potential match for the MACRO rule.

Since MACRO is very generic, I suppose you put it at the end of the rules, and due to the priority rules it will surely get tried last. It can match only single character tokens, so more precise rules will be prioritary.

If the MACRO rule is superseded by a more prioritary rule, it won't be considered and your predicate won't be invoked.

Semantic lexer predicate performance

2 Answers