1
votes

I am trying to implement an interpreter for a programming language, and ended up stumbling upon a case where I would need to backtrack, but my parser generator (ply, a lex&yacc clone written in Python) does not allow that Here's the rules involved:

'var_access_start : super'
'var_access_start : NAME'
'var_access_name : DOT NAME'
'var_access_idx : OPSQR expression CLSQR'
'''callargs : callargs COMMA expression
            | expression
            | '''
'var_access_metcall : DOT NAME LPAREN callargs RPAREN'
'''var_access_token : var_access_name
                    | var_access_idx
                    | var_access_metcall'''
'''var_access_tokens : var_access_tokens var_access_token
                     | var_access_token'''
'''fornew_var_access_tokens : var_access_tokens var_access_name
                            | var_access_tokens var_access_idx
                            | var_access_name
                            | var_access_idx'''
'type_varref : var_access_start fornew_var_access_tokens'
'hard_varref : var_access_start var_access_tokens'
'easy_varref : var_access_start'
'varref : easy_varref'
'varref : hard_varref'
'typereference : NAME'
'typereference : type_varref'
'''expression : new typereference LPAREN callargs RPAREN'''

'var_decl_empty : NAME'
'var_decl_value : NAME EQUALS expression'
'''var_decl : var_decl_empty
            | var_decl_value'''
'''var_decls : var_decls COMMA var_decl
             | var_decl'''
    'statement : var var_decls SEMIC'

The error occurs with statements of the form

var x = new SomeGuy.SomeOtherGuy();

where SomeGuy.SomeOtherGuy would be a valid variable that stores a type (types are first class objects) - and that type has a constructor with no arguments

What happens when parsing that expression is that the parser constructs a var_access_start = SomeGuy var_access_metcall = . SomeOtherGuy ( ) and then finds a semicolon and ends in an error state - I would clearly like the parser to backtrack, and try constructing an expression = new typereference(SomeGuy .SomeOtherGuy) LPAREN empty_list RPAREN and then things would work because the ; would match the var statement syntax all right

However, given that PLY does not support backtracking and I definitely do not have enough experience in parser generators to actually implement it myself - is there any change I can make to my grammar to work around the issue?

I have considered using -> instead of . as the "method call" operator, but I would rather not change the language just to appease the parser. Also, I have methods as a form of "variable reference" so you can do myObject.someMethod().aChildOfTheResult[0].doSomeOtherThing(1,2,3).helloWorld() but if the grammar can be reworked to achieve the same effect, that would also work for me

Thanks!

1

1 Answers

2
votes

I assume that your language includes expressions other than the ones you've included in the excerpt. I'm also going to assume that new, super and var are actually terminals.

The following is only a rough outline. For readability, I'm using bison syntax with quoted literals, but I don't think you'll have any trouble converting.

You say that "types are first-class values" but your syntax explicitly precludes using a method call to return a type. In fact, it also seems to preclude a method call returning a function, but that seems odd since it would imply that methods are not first-class values, even though types are. So I've simplified the grammar by allowing expressions like:

 new foo.returns_method_which_returns_type()()()

It's easy enough to add the restrictions back in, but it makes the exposition harder to follow.

The basic idea is that to avoid forcing the parser to make a premature decision; once new is encountered, it is only possible to distinguish between a method call and a constructor call from the lookahead token. So we need to make sure that the same reductions are used up to that point, which means that when the open parenthesis is encountered, we must still retain both possibilities.

primary: NAME
       | "super"
       ;

postfixed: primary
         | postfixed '.' NAME
         | postfixed '[' expression ']'
         | postfixed '(' call_args ')'            /* PRODUCTION 1 */
         ;

expression: postfixed
          | "new" postfixed '(' call_args ')'     /* PRODUCTION 2 */
       /* | other stuff not relevant here */
          ;

 /* Your callargs allows (,,,3). This one doesn't */
call_args : /* EMPTY */
          | expression_list
          ;
expression_list: expression
               | expression_list ',' expression
               ;

/* Another slightly simplified production */
var_decl: NAME
        | NAME '=' expression
        ;

var_decl_list: var_decl
             | var_decl_list ',' var_decl
             ;

statement: "var" var_decl_list ';'
      /* | other stuff not relevant here */
         ;

Now, take a look at PRODUCTION 1 and PRODUCTION 2, which are very similar. (Marked with comments.) These are basically the ambiguity for which you sought backtracking. However, in this grammar, there is no issue, since once a new has been encountered, the reduction of PRODUCTION 2 can only be performed when the lookahead token is , or ;, while PRODUCTION 1 can only be performed with lookahead tokens ., ( and [.

(Grammar tested with bison, just to make sure there are no conflicts.)