3
votes

Problem

I am trying to implement an error tolerant parser using Python Lex-Yacc (PLY), but I have trouble using error recovery rules at the end of my input string.

How can I recover from an unexpected end of input?

Example

This example grammar produces strings of the form A END A END A END A END ...

Statement   : Expressions

Expressions : Expression Expressions
            | 

Expression  : A END

I want to perform an error recovery if the END Token was omitted, so stings like A A A END or A A A will be recognized by the parser.

My approach

I added an error recovery rule, which allows me to accept input like A A A END

Expression : A END
           | A error

Which allows me to accept the following input: A A A END

But if the last END token is omitted (A A A), I still get a syntax error and cannot recover.


Sample PLY code

from __future__ import print_function

# Tokens
tokens = ('A', 'END')

t_A   = r'A'
t_END = r'END'
t_ignore = " "

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

# Build the lexer
import ply.lex as lex
lex.lex()

# Rules
def p_statement_expr(p):
    '''statement : expressions'''
    print("parsed:", p[1])

def p_expressions(p):
    '''expressions : expression expressions'''
    p[0] = [p[1]] + p[2]

def p_expressions_empty(p):
    '''expressions : '''
    p[0] = list()

def p_expression_pharse(p):
    '''expression : A END
                  | A error'''
    p[0] = 'A'

def p_error(p):
    if p:
        print("Syntax error at '%s'" % p.value)
    else:
        print("Syntax error at EOI")

import ply.yacc as yacc
yacc.yacc()

while 1:
    try:
        s = raw_input('query > ')   # use input() on Python 3
    except EOFError:
        break
    yacc.parse(s)
3
You say you want to be tolerant. But to you want to just emit a warning and considere all A as valid even if there is no end, only keep first A and ignore the others until end, only keep last A and ignore previous ones. Same at end, do you want to reject the terminal A or process it as valid ?Serge Ballesta
I want to emit a warning and accept all As as valid, even if there is no END.Jen-Ya

3 Answers

4
votes

I add it as a new answer (and do know it is too late for the bounty :-( ) because it is a very different approach. If we used flex, it would be much easier, since it has the notion of the <<EOF>> token that matches only at end of file. After thinking about that, I realized that it was very simple to add that functionality to PLY without any change to the original module by using a proxy around the lexer. And Python allows easy implementation of proxies thanks the the __getattr__ special method.

I just add

  • a new token EOF that will be send at end of file
  • a proxy around the token method of the lexer that on end of file returns the special EOF token on first pass and then the normal None
  • the eof token to end statement rule

And still reverse the rule expressions : expressions expression instead of expressions : expression expressions to allow immediate reduce

The code becomes :

from __future__ import print_function

# Tokens
tokens = ('A', 'END', 'EOF')

t_A   = r'A'
t_END = r'END'
t_ignore = " "

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

# Build the lexer
import ply.lex as lex

orig_lexer = lex.lex()

class ProxyLexer(object):
    def __init__(self, lexer, eoftoken):
        self.end = False
        self.lexer = lexer
        self.eof = eoftoken
    def token(self):
        tok = self.lexer.token()
        if tok is None:
            if self.end :
                self.end = False
            else:
                self.end = True
                tok = lex.LexToken()
                tok.type = self.eof
                tok.value = None
                tok.lexpos = self.lexer.lexpos
                tok.lineno = self.lexer.lineno
        # print ('custom', tok)
        return tok
    def __getattr__(self, name):
        return getattr(self.lexer, name)

lexer = ProxyLexer(orig_lexer, 'EOF')

# Rules
def p_statement_expr(p):
    '''statement : expressions EOF'''
    print("parsed:", p[1])

def p_expressions(p):
    '''expressions : expressions expression'''
    p[0] = p[1] + [p[2]]

def p_expressions_empty(p):
    '''expressions : '''
    p[0] = list()

def p_expression_pharse(p):
    '''expression : A END
                  | A error'''
    p[0] = 'A'

def p_error(p):
    if p:
        print("Syntax error at '%s'" % p.value)
    else:
        print("Syntax error at EOI")

import ply.yacc as yacc
parser = yacc.yacc()

while 1:
    try:
        s = raw_input('query > ')   # use input() on Python 3
    except EOFError:
        break
    parser.parse(s, lexer = lexer)

That way :

  • the original grammar is unchanged
  • the error recovery method remains stupidly simple and has no dependance on the remaining of the grammar
  • it can be easily extended to complex parsers
2
votes

As you want to accept all elements, you can explicitely declare a rule for a A not followed by a END and use the fact that yacc and PLY friendly deal with ambiguous rules.

You can simply have a normal rule :

Expression : A END

and below a lower priority rule (as it comes later) that will issue a warning

Expression : A

That way, all A will be accepted, there won't be any syntax error, and the warning will be issued for any A not followed by a END including one at the end of the flow. In order to more easily find the offending A, I have added in the warning the position of the symbol in the flow.

Edit:

The script is modified to correctly deal with other syntax error (such as AENDENDAEND), and also to immediately reduce expressions by replacing expressions : expression expressions with expressions : expressions expression

Here is the modified script (tested in python 3.4 simply replacing raw_input with input):

from __future__ import print_function

# Tokens
tokens = ('A', 'END')

t_A   = r'A'
t_END = r'END'
t_ignore = " "

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

# Build the lexer
import ply.lex as lex
lex.lex()

# Rules
def p_statement_expr(p):
    '''statement : expressions'''
    print("parsed:", p[1])

def p_expressions(p):
    '''expressions : expressions expression'''
    p[0] = p[1] + [p[2]]

def p_expressions_err(p):
    '''expressions : expressions error'''
    p[0] = p[1]

def p_expressions_empty(p):
    '''expressions : '''
    p[0] = list()

def p_expression_pharse(p):
    '''expression : A END'''
    p[0] = 'A'

# add a separate rule BELOW previous one to display a warning
def p_expression_pharse_warn(p):
    '''expression : A'''
    print("Warning at absolute position %d (line %d)" % (p.lexpos(1), p.lineno(1)))
    p[0] = 'A'

def p_error(p):
    if p:
        print("Syntax error at '%s'" % p.value)
    else:
        print("Syntax error at EOI")


import ply.yacc as yacc
yacc.yacc()

while 1:
    try:
        s = raw_input('query > ')   # use input() on Python 3
    except EOFError:
        break
    yacc.parse(s)

Edit : the following is an incorrect attempt to avoid an additional rule : it is more complex and less efficient than the above version. Please see my conclusion below

Edit per comment :

I understand your point that you do not want to multiply grammar rules. It is possible to be fault tolerant, except for last token. If your last token is in error, it will not be followed by anything and will never be caught in rule expression : A error.

But here is a fault tolerant parser that keeps everything except last token if case of error on that one :

from __future__ import print_function

# Tokens
tokens = ('A', 'END')

t_A   = r'A'
t_END = r'END'
t_ignore = " "

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

# Build the lexer
import ply.lex as lex
lex.lex()

# Rules
def p_statement_expr(p):
    '''statement : expressions'''
    # print("parsed:", p[1])

def p_expressions(p):
    '''expressions : expressions expression'''
    p[0] = p[1] + [p[2]]
    result.append(p[2])

def p_expressions_empty(p):
    '''expressions : '''
    p[0] = list()

def p_expression_pharse(p):
    '''expression : A END
                  | A error'''
    p[0] = 'A'

def p_error(p):
    if p:
        global lasterr
        print("Syntax error at '%s' (%d)" % (p.value, p.lexpos))
    else:
        print("Syntax error at EOI")

import ply.yacc as yacc
yacc.yacc()

while 1:
    try:
        s = input('query > ')   # use input() on Python 3
    except EOFError:
        break
    result = []
    yacc.parse(s)
    print('Result', result)

The princip is to collate by expressions : expressions expression instead of expressions : expression expressions, and to keep all in a global variable.

With an input of A END A A END A A A END it gives

Result ['A', 'A', 'A', 'A', 'A', 'A']

and with : A END A A END A A A END , it gives

Result ['A', 'A', 'A', 'A', 'A']

(all tokens but the last)

With a true flex - bison solution, it would be possible to make use of the special <<EOF>> token that matches at end of input, to always have another token after the last one. Unfortunately, it is not implemented in PLY, and the only real solution is to introduce a rule that accepts alone A token. For a real parser, it also guarantees that you are actually processing the correct token : I used

def p_expression_pharse(p):
    '''expression : A END'''
    p[0] = 1 + p.lexpos(1)

# add a separate rule BELOW previous one to display a warning
def p_expression_pharse_warn(p):
    '''expression : A'''
    print("Warning at absolute position %d (line %d)" % (p.lexpos(1), p.lineno(1)))
    p[0] = -1 - p.lexpos(1)

to uniquely identify tokens in resul string, and I get correct positions.

And ... the error processing is very simple ...

Discussion TL/DR :

I admit I missed the point of last token error recovery. It is because in all parsers I've seen in real use cases, the error recovery consisted in rejecting the part that was syntactically incorrect (and thus not directly useable) and re-synchonizing the parser on next correct group of token. In all what I have seen, if a partial sentence can be used, it must not be processed by the error recovery mechanizme but by a grammar rule, in which it is easy to describe the appropriate action.

If you just want to keep the offending input for later processing, I think it is not a problem of action depending of a syntax, and I would simply note the position of offending token, or at most note the position of last correctly analysed token (the end of a complete element), the begin of first error recovery token and say that what is between is incorrect.

But it would be much different than what is asked here ...

1
votes

This works for all examples I could imagine

from __future__ import print_function

# Tokens
tokens = ('A', 'END')

t_A   = r'A'
t_END = r'END'
t_ignore = " "

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

# Build the lexer
import ply.lex as lex
lex.lex()

# Rules
def p_statement_expr(p):
    '''statement : expressions'''
    #
    print("parsed:", p[1])

def p_expressions(p):
    '''expressions : expression expressions'''
    p[0] = p[1] + p[2]

def p_expressions_empty(p):
    '''expressions : '''
    p[0] = list()

def p_expression_pharse(p):
    '''expression : A END'''
    p[0] = ['A']

def p_expression_error(p):
    '''expression : A error'''
    p[0] = ['A']
    if p[2] is not None:
        p[0] += p[2]

def p_error(p):
    if p is None:
        print("Syntax error at EOI")
        e = yacc.YaccSymbol()
        e.type = 'error'
        e.value = None
        yacc.errok()
        return e
    elif p.type == 'error':
        yacc.errok()
        return
    elif hasattr(p, 'value'):
        print("Syntax error at '%s'" % p.value)
        e = yacc.YaccSymbol()
        e.type = 'error'
        e.value = p.value
        yacc.errok()
        return e




import ply.yacc as yacc
yacc.yacc()

while 1:
    try:
        s = raw_input('query > ')   # use input() on Python 3
    except EOFError:
        break
    yacc.parse(s)