Python parser for CTF grammar from nonterminals

Question

In Python 3, most common parser generators such as ANTLR or Lark define grammars by deriving nonterminal from terminals from strings, and construct a lexer and a parser to evaluate strings.

Instead of that, I am looking for a parser generator that works on an "intermediate input" consisting of nonterminals and terminals only, meaning I would do the lexing and parts of the parsing beforehand.

For example, if the input grammar is

S -> AB
A -> a | aA
B -> b | aB

where uppercase letters are nonterminals and lowercase letters are terminals, a possible input could be aaaB, from which a parse tree with root S could then be constructed.

The input couldn't actually just be a string of ASCII characters like aaaB, as the nonterminals would have to store information about their own subtrees. So at least those would have to be arbitrary objects and the input would more likely be a list of objects.

Is there a library or package which offers this functionality?

Also, this question will get closed as violating the SO ban on asking for software recommendations. I can show you how to use a custom lexer function with Ply, which is really pretty simple, but that would be an implicit promotion of Ply on top of making a lot of assumptions about what you really are trying to do. — rici
@rici Without knowing Ply, that sounds like it would only solve my problem partially by removing the need for strings but still leaving open the question how to use nonterminals in the input. Not sure what you mean by "what I am trying to do", as I tried my best to describe that in the OP. — Andreas T
Well, you'll have to read the Ply manual yourself. I did quote the key sentence which effectively tells you that whatever value is associated with the pseudoterminal is completely up to you. — rici
I think this is explicitly off topic for Stack Overflow. See: help center. — AMC

rici rici · Accepted Answer · 2020-02-04T21:37:04

Note: This is not an endorsement. There likely are many other parsing packages for Python which offer similar functionality. It's presented simply as a possible mechanism for achieving the (presumed) goal.

The Ply parsing package does contain a lexer generator but you are under no obligation to use it; you're free to use whatever function you like to provide lexical tokens. Tokens are simply object types, which include a value attribute. However, Ply does not make any assumptions about the value object other than requiring it to exist.

Citing the manual here and also here:

When tokens are returned by lex, they have a value that is stored in the value attribute. Normally, the value is the text that was matched. However, the value can be assigned to any Python object.

…

If you are going to create a hand-written lexer and you plan to use it with yacc.py, it only needs to conform to the following requirements:

It must provide a token() method that returns the next token or None if no more tokens are available.
The token() method must return an object tok that has type and value attributes. If line number tracking is being used, then the token should also define a lineno attribute.

In order to pass non-terminals into the parser, you'll need to make them look like terminals. The easiest way to do that will be to make a pseudo-terminal for each non-terminal, and convert the pseudo-terminal with a unit production. For example, assuming the pseudo-terminals are given names ending with an _ (which would make them pretty easy to generate programmatically, you could modify the rule

B : b | a B

into

B : b
  | a B
  | B_

If we assume that you inject the correct AST value into the token object returned with the B_ terminal, all you will need is to add a single action function to your grammar:

 def p_pseudo_terminals(p):
     '''A : A_
        B : B_
     '''
     p[0] = p[1]

Here's a complete runnable example, using the grammar in the question:

file:inject.py

from ply import yacc
from collections import namedtuple
Token = namedtuple('Token', ['type', 'value'])
Node = namedtuple('Node', ['type', 'children'])

tokens = ['a', 'b', 'A_', 'B_']
def p_S(p):
    '''S : A B'''
    p[0] = Node('S', [ p[1], p[2] ])

def p_A(p):
    '''A : a'''
    p[0] = Node('A', [ p[1] ])

def p_Al(p):
    '''A : a A'''
    p[0] = Node('A', [ p[1], p[2] ])

def p_B(p):
    '''B : b'''
    p[0] = Node('B', [ p[1] ])

def p_Bl(p):
    '''B : a B'''
    p[0] = Node('B', [ p[1], p[2] ])

def p_pseudo(p):
    '''A : A_
       B : B_
    '''
    p[0] = p[1]

class Lexer(object):
    def __init__(self, iter):
        self.iter = iter.__iter__()

    def token(self):
        try:
            retval = next(self.iter)
            if type(retval) == Token:
                # Input is a token, just pass it through
                return retval
            else:
                # Input is an AST node; fabricate a pseudo-token
                return Token(retval.type + '_', retval)
        except StopIteration:
            return None

parser = yacc.yacc()

And a sample run:

$ python3
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from inject import *
>>> token_stream = [Token('a', 'a')] * 3 + [Node('B', ['a', Node('B', children=['a', Node('B', children=['b'])])])]
>>> lexer = Lexer(token_stream)
>>> print(parser.parse(None, lexer=lexer))
Node(type='S', children=[Node(type='A', children=['a', Node(type='A', children=['a', Node(type='A', children=['a'])])]), Node(type='B', children=['a', Node(type='B', children=['a', Node(type='B', children=['b'])])])])

In a real grammar, it might be tedious to type all of those pseudo-token names and unit productions. You can take advantage of the fact that you can generate the docstring yourself as long as it is in place before you build the parser by calling yacc.yacc. You'll also need to add the pseudotoken names to the list of token names so that Ply knows that B_ is a token.

Python parser for CTF grammar from nonterminals

2 Answers

file:inject.py