1
votes

I cannot get JavaCC to properly disambiguate tokens by their place in a grammar. I have the following JJTree file (I'll call it bug.jjt):

options
{
  LOOKAHEAD = 3;
  CHOICE_AMBIGUITY_CHECK = 2;
  OTHER_AMBIGUITY_CHECK = 1;
  SANITY_CHECK = true;
  FORCE_LA_CHECK = true;
}
PARSER_BEGIN(MyParser)
import java.util.*;
public class MyParser {
    public static void main(String[] args) throws ParseException {
        MyParser parser = new MyParser(new java.io.StringReader(args[0]));
        SimpleNode root = parser.production();
        root.dump("");
    }
}
PARSER_END(MyParser)

SKIP:
{
    " "
}

TOKEN:
{
     <STATE:  ("state")>
     |<PROD_NAME: (["a"-"z"])+ >
}


SimpleNode production():
{}
{
    (
        <PROD_NAME>
        <STATE>
        <EOF>
    )
    {return jjtThis;}
}

Generate the parser code with the following:

java -cp C:\path\to\javacc.jar jjtree bug.jjt
java -cp C:\path\to\javacc.jar javacc bug.jj

Now after compiling this, you can give run MyParser from the command line with a string to parse as the argument. It prints production if successful and spews an error if it fails.

I tried two simple inputs: foo state and state state. The first one parses, but the second one does not, since both state strings are tokenized as <STATE>. As I set LOOKAHEAD to 3, I expected it to use the grammar and see that one string state must be <STATE> and the other must be <PROD_NAME. However, no such luck. I have tried changing the various lookahead parameters to no avail. I am also not able to use tokenizer states (where you define different tokens allowable in different states), as this example is part of a more complicated system that will probably have a lot of these types of ambiguities.

Can anyone tell me how to make JavaCC properly disambiguate these tokens, without using tokenizer states?

3

3 Answers

2
votes

This is covered in the FAQ under question 4.19.

There are three strategies outlined there

Putting choices in the grammar. See Bart Kiers's answer.

Using semantic look ahead. For this approach you get rid of the production defining STATE and write your grammar like this

void SimpleNode production():
{}
{
    (
        <PROD_NAME>
        ( LOOKAHEAD({getToken(1).kind == PROD_NAME && getToken(1).image.equals("state")})
         <PROD_NAME>
         ...
        |
         ...other choices...
        )
    )
    {return jjtThis;}
}

If there are no other choices, then

void SimpleNode production():
{}
{
    (
        <PROD_NAME>
        ( LOOKAHEAD({getToken(1).kind == PROD_NAME && getToken(1).image.equals("state")})
         <PROD_NAME>
         ...
        |
         { int[][] expTokSeqs = new int[][] { new int[] {STATE } } ;
           throw new ParseException(token, expTokSeqs, tokenImage) ; }
        )
    )
    {return jjtThis;}
}

But, in this case, you need a production for STATE, as it is mentioned in the initialization of expTokSeqs. So you need a production.

< DUMMY > TOKEN : { < STATE : "state" > }

where DUMMY is a state that is never gone to.

Using lexical states. The title of the OP's question suggests he doesn't want to do this, but not why. It can be done if the state switching can be contained in the token manager. Suppose a file is a sequence of productions and each of production look like this.

name state : "a" | "b" name ;

That is it starts with a name, then the keyword "state" a colon, some tokens and finally a semicolon. (I'm just making this up as I have no idea what sort of language the OP is trying to parse.) Then you can use three lexical states DEFAULT, S0, and S1.

  • In the DEFAULT any sequence of letters (including "state") is a PROD_NAME. In DEFAULT, recognizing a PROD_NAME switches the state to S0.
  • In S0 any sequence of letters except "state" is a PROD_NAME and "state" is a STATE. In S0, recognizing a STATE token causes the tokenizer to switch to state S1.
  • In S1 any any sequence of letters (including "state") is a PROD_NAME. In S1, recognizing a SEMICOLON switches the state to DEFAULT.

So our example is tokenized like this

name  state  : "a" | "b" name ;
|__||______||_________________||_________
DEF-   S0             S1                    DEFAULT
AULT

The productions are written like this

<*> SKIP: { " " }

<S0> TOKEN: { <STATE: "state"> : S1 }

<DEFAULT> TOKEN:{ <PROD_NAME: (["a"-"z"])+ >  : S0 }
<S0,S1> TOKEN:{ <PROD_NAME: (["a"-"z"])+ >  }

<S1> TOKEN: { <SEMICOLON : ";" > : DEFAULT
<S0, DEFAULT> TOKEN : { <SEMICOLON : ";" > }

<*> TOKEN {
     COLON : ":"
|    ...etc...
}

It is possible for the parser to send state switching commands back to the tokenizer, but it is tricky to get it right and fragile. Se question 3.12 of the FAQ.

1
votes

Lookahead does not concern the lexer while it composes characters to tokens. It is used by the parser when it matches non-terminals as composed from terminals (tokens).

If you define "state" to result in a token STATE, well, then that's what it is.

I agree with you, that tokenizer states aren't a good solution for permitting keywords to be used as identifiers. Is this really necessary? There's a good reason for HLL's not to permit this.

OTOH, if you can rewrite your grammar using just <PROD_NAME>s you might postpone the recognitions of the keywords during semantic analysis.

1
votes

The LOOKAHEAD option only applies to the parser (production rules). The tokenizer is not affected by this: it will produce tokens without worrying what a production rule is trying to match. The input "state" will always be tokenized as a STATE, even if the parser is trying to match a PROD_NAME.

You could do something like this (untested, pseudo-ish grammar code ahead!):

SimpleNode production():
{}
{
    (
        prod_name_or_state()
        <STATE>
        <EOF>
    )
    {return jjtThis;}
}

SimpleNode prod_name_or_state():
{}
{
    (   
        <PROD_NAME>
    |   <STATE>
    )
    {return jjtThis;}
}

which would match both "foo state" and "state state".

Or the equivalent, but more compact:

SimpleNode production():
{}
{
    (
        ( <PROD_NAME> | <STATE> )
        <STATE>
        <EOF>
    )
    {return jjtThis;}
}