I'm using ANTLR4 to generate a Lexer for some JavaScript preprocessor (basically it tokenizes a javascript file and extracts every string literal).
I used a grammar originally made for Antlr3, and imported the relevant parts (only the lexer rules) for v4.
I have just one single issue remaining: I don't know how to handle corner cases for RegEx literals, like this:
log(Math.round(v * 100) / 100 + ' msec/sample');
The / 100 + ' msec/
is interpreted as a RegEx literal, because the lexer rule is always active.
What I would like is to incorporate this logic (C# code. I would need JavaScript, but simply I don't know how to adapt it):
/// <summary>
/// Indicates whether regular expression (yields true) or division expression recognition (false) in the lexer is enabled.
/// These are mutual exclusive and the decision which is active in the lexer is based on the previous on channel token.
/// When the previous token can be identified as a possible left operand for a division this results in false, otherwise true.
/// </summary>
private bool AreRegularExpressionsEnabled
{
get
{
if (Last == null)
{
return true;
}
switch (Last.Type)
{
// identifier
case Identifier:
// literals
case NULL:
case TRUE:
case FALSE:
case THIS:
case OctalIntegerLiteral:
case DecimalLiteral:
case HexIntegerLiteral:
case StringLiteral:
// member access ending
case RBRACK:
// function call or nested expression ending
case RPAREN:
return false;
// otherwise OK
default:
return true;
}
}
}
This rule was present in the old grammar as an inline predicate, like this:
RegularExpressionLiteral
: { AreRegularExpressionsEnabled }?=> DIV RegularExpressionFirstChar RegularExpressionChar* DIV IdentifierPart*
;
But I don't know how to use this technique in ANTLR4.
In the ANTLR4 book, there are some suggestions about solving this kind of problems at the parser level (chapter 12.2 - context sensitive lexical problems), but I don't want to use a parser. I want just to extract all the tokens, leave everything untouched except for the string literals, and keep the parsing out of my way.
Any suggestion would be really appreciated, thanks!
/
delimiter char). – Lucas Trzesniewski