2
votes

I'm trying to code a context-sensitive lexer rule using ANTLR but can't get it to do what I need. The rule needs to match 1 of 2 alternatives based on characters found in the beginning of the rule. Below is greatly simplified version of the problem.

This example grammar:

lexer grammar X;

options
{
  language = C;
}

RULE :
  SimpleIdent {ctx->someFunction($SimpleIdent);}
  (
    {ctx->test != true}?
     //Nothing
  | {ctx->test == true}?
     SLSpace+ OtherText
  )
  ;

fragment SimpleIdent  : ('a'..'z' | 'A'..'Z' | '_')+;
fragment SLSpace    : ' ';
fragment OtherText :  (~'\n')* '\n';

I would expect the lexer to exit this rule if ctx->test is false, ignoring any characters after SimpleIdent. Unfortunately ANTLR will test the character after SimpleIdent before the predicate is tested and thus will always take the second alternative if there is a space there. This is clearly shown in the C code:

// X.g:10:3: ({...}?|{...}? ( SLSpace )+ OtherText )
{
    int alt2=2;
    switch ( LA(1) )
    {
    case '\t':
    case ' ':
        {
            alt2=2;
        }
        break;

    default:
        alt2=1;
    }

    switch (alt2)
    {
    case 1:
        // X.g:11:5: {...}?
        {
            if ( !((ctx->test != true)) )
            {
                    //Exception
            }

        }
        break;
    case 2:
        // X.g:13:5: {...}? ( SLSpace )+ OtherText
        {
            if ( !((ctx->test == true)) )
            {
                   //Exception
            }

How can I force ANTLR to take a specific path in the lexer at runtime?

1

1 Answers

2
votes

Use a gated semantic predicate instead of a validating semantic predicate 1. A validating predicate throws an exception if the expression validates to false. And let the "Nothing alternative" be the last to match.

Also, OtherText also matches what SLSpace, making SLSpace+ OtherText ambiguous. Simply remove SLSpace+ from it, or let OtherText start with something other than a ' '.

I'm not that familiar with the C target, but this Java demo should work just fine for C (after translating the Java code, of course):

grammar T;

rules
 : RULE+ EOF
 ;

RULE
 : SimpleIdent {boolean flag = $SimpleIdent.text.startsWith("a");}
   ( {!flag}?=> OtherText
   |            // Nothing
   )
 ;

Spaces 
 : (' ' | '\t' | '\r' | '\n')+ {skip();}
 ;

fragment SimpleIdent : ('a'..'z' | 'A'..'Z' | '_')+;
fragment OtherText   : (~'\n')* '\n';

If you'd now parse the input:

abcd efgh ijkl mnop
bbb aaa ccc ddd

you'll get the following parse:

enter image description here

I.e. whenever a RULE starts with a lower case "a", it doesn't match all the way to the end of the line.

1What is a 'semantic predicate' in ANTLR?