This problem can be approached using lexical modes by having one mode for the inside of strings and one (or more) for the outside. Seeing a "
on the outside would switch to the inside mode and seeing a \(
or "
would switch back outside. The only complicated part would be seeing a )
on the outside: Sometimes it should switch back to the inside (because it corresponds to a \(
) and some times it shouldn't (when it corresponds to a plain (
).
The most basic way to achieve this would be like this:
Lexer:
lexer grammar StringLexer;
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* ;
DQUOTE: '"' -> pushMode(IN_STRING);
LPAR: '(' -> pushMode(DEFAULT_MODE);
RPAR: ')' -> popMode;
mode IN_STRING;
TEXT: ~[\\"]+ ;
BACKSLASH_PAREN: '\\(' -> pushMode(DEFAULT_MODE);
ESCAPE_SEQUENCE: '\\' . ;
DQUOTE_IN_STRING: '"' -> type(DQUOTE), popMode;
Parser:
parser grammar StringParser;
options {
tokenVocab = 'StringLexer';
}
start: exp EOF ;
exp : '(' exp ')'
| IDENTIFIER
| DQUOTE stringContents* DQUOTE
;
stringContents : TEXT
| ESCAPE_SEQUENCE
| '\\(' exp ')'
;
Here we push the default mode every time we see a (
or \(
and pop the mode every time we see a )
. This way it will go back inside the string only if the mode on top of the stack is the string mode, which would only be the case if there aren't any unclosed (
left since the last \(
.
This approach works, but has the downside that an unmatched )
will cause an empty stack exception rather than a normal syntax error because we're calling popMode
on an empty stack.
To avoid this, we can add a member that tracks how deeply nested we are inside parentheses and doesn't pop the stack when the nesting level is 0 (i.e. if the stack is empty):
@members {
int nesting = 0;
}
LPAR: '(' {
nesting++;
pushMode(DEFAULT_MODE);
};
RPAR: ')' {
if (nesting > 0) {
nesting--;
popMode();
}
};
mode IN_STRING;
BACKSLASH_PAREN: '\\(' {
nesting++;
pushMode(DEFAULT_MODE);
};
(The parts I left out are the same as in the previous version).
This works and produces normal syntax errors for unmatched )
s. However, it contains actions and is thus no longer language-agnostic, which is only a problem if you plan to use the grammar from multiple languages (and depending on the language, you might even be lucky and the code might be valid in all of your targeted languages).
If you want to avoid actions, the last alternative would be to have three modes: One for code that's outside of any strings, one for the inside of the string and one for the inside of \()
. The third one will be almost identical to the outer one, except that it will push and pop the mode when seeing parentheses, whereas the outer one will not. To make both modes produce the same types of tokens, the rules in the third mode will all call type()
. This will look like this:
lexer grammar StringLexer;
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* ;
DQUOTE: '"' -> pushMode(IN_STRING);
LPAR: '(';
RPAR: ')';
mode IN_STRING;
TEXT: ~[\\"]+ ;
BACKSLASH_PAREN: '\\(' -> pushMode(EMBEDDED);
ESCAPE_SEQUENCE: '\\' . ;
DQUOTE_IN_STRING: '"' -> type(DQUOTE), popMode;
mode EMBEDDED;
E_IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* -> type(IDENTIFIER);
E_DQUOTE: '"' -> pushMode(IN_STRING), type(DQUOTE);
E_LPAR: '(' -> type(LPAR), pushMode(EMBEDDED);
E_RPAR: ')' -> type(RPAR), popMode;
Note that we now can no longer use string literals in the parser grammar because string literals can't be used when multiple lexer rules are defined using the same string literal. So now we have to use LPAR
instead of '('
in the parser and so on (we already had to do this for DQUOTE
for the same reason).
Since this version involves a lot of duplication (especially as the amount of tokens rises) and prevents the use of string literals in the parser grammar, I generally prefer the version with the actions.
The full code for all three alternatives can also be found on GitHub.
\()
, it isn't as simple as that. You'd either have to invoke the parser again after the\(
or repeat a lot of the parser's work in post-processing. – sepp2k