Python Lex-Yacc (PLY) Error recovery at the end of input

Question

Problem

I am trying to implement an error tolerant parser using Python Lex-Yacc (PLY), but I have trouble using error recovery rules at the end of my input string.

How can I recover from an unexpected end of input?

Example

This example grammar produces strings of the form A END A END A END A END ...

Statement   : Expressions

Expressions : Expression Expressions
            | 

Expression  : A END

I want to perform an error recovery if the END Token was omitted, so stings like A A A END or A A A will be recognized by the parser.

My approach

I added an error recovery rule, which allows me to accept input like A A A END

Expression : A END
           | A error

Which allows me to accept the following input: A A A END

But if the last END token is omitted (A A A), I still get a syntax error and cannot recover.

Sample PLY code

from __future__ import print_function

# Tokens
tokens = ('A', 'END')

t_A   = r'A'
t_END = r'END'
t_ignore = " "

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

# Build the lexer
import ply.lex as lex
lex.lex()

# Rules
def p_statement_expr(p):
    '''statement : expressions'''
    print("parsed:", p[1])

def p_expressions(p):
    '''expressions : expression expressions'''
    p[0] = [p[1]] + p[2]

def p_expressions_empty(p):
    '''expressions : '''
    p[0] = list()

def p_expression_pharse(p):
    '''expression : A END
                  | A error'''
    p[0] = 'A'

def p_error(p):
    if p:
        print("Syntax error at '%s'" % p.value)
    else:
        print("Syntax error at EOI")

import ply.yacc as yacc
yacc.yacc()

while 1:
    try:
        s = raw_input('query > ')   # use input() on Python 3
    except EOFError:
        break
    yacc.parse(s)

You say you want to be tolerant. But to you want to just emit a warning and considere all A as valid even if there is no end, only keep first A and ignore the others until end, only keep last A and ignore previous ones. Same at end, do you want to reject the terminal A or process it as valid ? — Serge Ballesta
I want to emit a warning and accept all As as valid, even if there is no END. — Jen-Ya

Serge Ballesta Serge Ballesta · Accepted Answer · 2014-10-24T09:29:49

I add it as a new answer (and do know it is too late for the bounty :-( ) because it is a very different approach. If we used flex, it would be much easier, since it has the notion of the <<EOF>> token that matches only at end of file. After thinking about that, I realized that it was very simple to add that functionality to PLY without any change to the original module by using a proxy around the lexer. And Python allows easy implementation of proxies thanks the the __getattr__ special method.

I just add

a new token EOF that will be send at end of file
a proxy around the token method of the lexer that on end of file returns the special EOF token on first pass and then the normal None
the eof token to end statement rule

And still reverse the rule expressions : expressions expression instead of expressions : expression expressions to allow immediate reduce

The code becomes :

from __future__ import print_function

# Tokens
tokens = ('A', 'END', 'EOF')

t_A   = r'A'
t_END = r'END'
t_ignore = " "

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

# Build the lexer
import ply.lex as lex

orig_lexer = lex.lex()

class ProxyLexer(object):
    def __init__(self, lexer, eoftoken):
        self.end = False
        self.lexer = lexer
        self.eof = eoftoken
    def token(self):
        tok = self.lexer.token()
        if tok is None:
            if self.end :
                self.end = False
            else:
                self.end = True
                tok = lex.LexToken()
                tok.type = self.eof
                tok.value = None
                tok.lexpos = self.lexer.lexpos
                tok.lineno = self.lexer.lineno
        # print ('custom', tok)
        return tok
    def __getattr__(self, name):
        return getattr(self.lexer, name)

lexer = ProxyLexer(orig_lexer, 'EOF')

# Rules
def p_statement_expr(p):
    '''statement : expressions EOF'''
    print("parsed:", p[1])

def p_expressions(p):
    '''expressions : expressions expression'''
    p[0] = p[1] + [p[2]]

def p_expressions_empty(p):
    '''expressions : '''
    p[0] = list()

def p_expression_pharse(p):
    '''expression : A END
                  | A error'''
    p[0] = 'A'

def p_error(p):
    if p:
        print("Syntax error at '%s'" % p.value)
    else:
        print("Syntax error at EOI")

import ply.yacc as yacc
parser = yacc.yacc()

while 1:
    try:
        s = raw_input('query > ')   # use input() on Python 3
    except EOFError:
        break
    parser.parse(s, lexer = lexer)

That way :

the original grammar is unchanged
the error recovery method remains stupidly simple and has no dependance on the remaining of the grammar
it can be easily extended to complex parsers