2
votes

I'm using ANTLR4 to generate a Lexer for some JavaScript preprocessor (basically it tokenizes a javascript file and extracts every string literal).

I used a grammar originally made for Antlr3, and imported the relevant parts (only the lexer rules) for v4.

I have just one single issue remaining: I don't know how to handle corner cases for RegEx literals, like this:

log(Math.round(v * 100) / 100 + ' msec/sample');

The / 100 + ' msec/ is interpreted as a RegEx literal, because the lexer rule is always active.

What I would like is to incorporate this logic (C# code. I would need JavaScript, but simply I don't know how to adapt it):

    /// <summary>
    /// Indicates whether regular expression (yields true) or division expression recognition (false) in the lexer is enabled.
    /// These are mutual exclusive and the decision which is active in the lexer is based on the previous on channel token.
    /// When the previous token can be identified as a possible left operand for a division this results in false, otherwise true.
    /// </summary>
    private bool AreRegularExpressionsEnabled
    {
        get
        {
            if (Last == null)
            {
                return true;
            }

            switch (Last.Type)
            {
                // identifier
                case Identifier:
                // literals
                case NULL:
                case TRUE:
                case FALSE:
                case THIS:
                case OctalIntegerLiteral:
                case DecimalLiteral:
                case HexIntegerLiteral:
                case StringLiteral:
                // member access ending 
                case RBRACK:
                // function call or nested expression ending
                case RPAREN:
                    return false;

                // otherwise OK
                default:
                    return true;
            }
        }
    }

This rule was present in the old grammar as an inline predicate, like this:

RegularExpressionLiteral
    : { AreRegularExpressionsEnabled }?=> DIV RegularExpressionFirstChar RegularExpressionChar* DIV IdentifierPart*
    ;

But I don't know how to use this technique in ANTLR4.

In the ANTLR4 book, there are some suggestions about solving this kind of problems at the parser level (chapter 12.2 - context sensitive lexical problems), but I don't want to use a parser. I want just to extract all the tokens, leave everything untouched except for the string literals, and keep the parsing out of my way.

Any suggestion would be really appreciated, thanks!

1
This obviously is a problem you cannot solve alone by lexing. Lexing only gives you token values for certain input. It doesn't have any information how to handle that RE input. If the meaning of a specific input sequence changes, depending on some context, then you can handle that only on either the parser side or manually by adding a semantic phase after lexing.Mike Lischke
While your comment is true, when referring to the abstract task of lexing, in Antlr3 you could attach small bits of logic to a lexer grammar, just as much as needed to solve my problem. I didn't need a parser in v3. Do I need it now in v4?A. Chiesa
You can still use predicates in ANTLR4, but the syntax is different. Also, put the predicate at the end of the rule for performance reasons (or better yet, just after the first / delimiter char).Lucas Trzesniewski

1 Answers

0
votes

I'm posting here the final solution, developed adapting the existing one to the new syntax of ANTLR4, and addressing the differences in JavaScript syntax.

I'm posting just the relevant parts, to give a clue to someone else about a working strategy.

The rule was edited as follows:

RegularExpressionLiteral
    : DIV {this.isRegExEnabled()}? RegularExpressionFirstChar RegularExpressionChar* DIV IdentifierPart*
    ;

The function isRegExEnabled is defined in a @members section on top of the lexer grammar, as follows:

@members {
EcmaScriptLexer.prototype.nextToken = function() {
  var result = antlr4.Lexer.prototype.nextToken.call(this, arguments);
  if (result.channel !== antlr4.Lexer.HIDDEN) {
    this._Last = result;
  }

  return result;
}

EcmaScriptLexer.prototype.isRegExEnabled = function() {
  var la = this._Last ? this._Last.type : null;
  return la !== EcmaScriptLexer.Identifier &&
    la !== EcmaScriptLexer.NULL &&
    la !== EcmaScriptLexer.TRUE &&
    la !== EcmaScriptLexer.FALSE &&
    la !== EcmaScriptLexer.THIS &&
    la !== EcmaScriptLexer.OctalIntegerLiteral &&
    la !== EcmaScriptLexer.DecimalLiteral &&
    la !== EcmaScriptLexer.HexIntegerLiteral &&
    la !== EcmaScriptLexer.StringLiteral &&
    la !== EcmaScriptLexer.RBRACK &&
    la !== EcmaScriptLexer.RPAREN;
}}

As you can see, two functions are defined, one is an override of lexer's nextToken method, which wraps the existing nextToken and saves the last non-comment-or-whitespace token for reference. Then, the semantic predicate invokes isRegExEnabled checking if the last significative token is compatible with the presence of RegEx literals. If it's not, it returns false.

Thanks to Lucas Trzesniewski for the comment: it pointed me in the right direction, and to Patrick Hulsmeijer for the original work on v3.