Note: This is not an endorsement. There likely are many other parsing packages for Python which offer similar functionality. It's presented simply as a possible mechanism for achieving the (presumed) goal.
The Ply parsing package does contain a lexer generator but you are under no obligation to use it; you're free to use whatever function you like to provide lexical tokens. Tokens are simply object types, which include a value
attribute. However, Ply does not make any assumptions about the value
object other than requiring it to exist.
Citing the manual here and also here:
When tokens are returned by lex, they have a value that is stored in the value
attribute. Normally, the value is the text that was matched. However, the value can be assigned to any Python object.
…
If you are going to create a hand-written lexer and you plan to use it with yacc.py, it only needs to conform to the following requirements:
- It must provide a
token()
method that returns the next token or None
if no more tokens are available.
- The
token()
method must return an object tok
that has type
and value
attributes. If line number tracking is being used, then the token should also define a lineno
attribute.
In order to pass non-terminals into the parser, you'll need to make them look like terminals. The easiest way to do that will be to make a pseudo-terminal for each non-terminal, and convert the pseudo-terminal with a unit production. For example, assuming the pseudo-terminals are given names ending with an _
(which would make them pretty easy to generate programmatically, you could modify the rule
B : b | a B
into
B : b
| a B
| B_
If we assume that you inject the correct AST value into the token object returned with the B_
terminal, all you will need is to add a single action function to your grammar:
def p_pseudo_terminals(p):
'''A : A_
B : B_
'''
p[0] = p[1]
Here's a complete runnable example, using the grammar in the question:
file:inject.py
from ply import yacc
from collections import namedtuple
Token = namedtuple('Token', ['type', 'value'])
Node = namedtuple('Node', ['type', 'children'])
tokens = ['a', 'b', 'A_', 'B_']
def p_S(p):
'''S : A B'''
p[0] = Node('S', [ p[1], p[2] ])
def p_A(p):
'''A : a'''
p[0] = Node('A', [ p[1] ])
def p_Al(p):
'''A : a A'''
p[0] = Node('A', [ p[1], p[2] ])
def p_B(p):
'''B : b'''
p[0] = Node('B', [ p[1] ])
def p_Bl(p):
'''B : a B'''
p[0] = Node('B', [ p[1], p[2] ])
def p_pseudo(p):
'''A : A_
B : B_
'''
p[0] = p[1]
class Lexer(object):
def __init__(self, iter):
self.iter = iter.__iter__()
def token(self):
try:
retval = next(self.iter)
if type(retval) == Token:
# Input is a token, just pass it through
return retval
else:
# Input is an AST node; fabricate a pseudo-token
return Token(retval.type + '_', retval)
except StopIteration:
return None
parser = yacc.yacc()
And a sample run:
$ python3
Python 3.6.9 (default, Nov 7 2019, 10:44:02)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from inject import *
>>> token_stream = [Token('a', 'a')] * 3 + [Node('B', ['a', Node('B', children=['a', Node('B', children=['b'])])])]
>>> lexer = Lexer(token_stream)
>>> print(parser.parse(None, lexer=lexer))
Node(type='S', children=[Node(type='A', children=['a', Node(type='A', children=['a', Node(type='A', children=['a'])])]), Node(type='B', children=['a', Node(type='B', children=['a', Node(type='B', children=['b'])])])])
In a real grammar, it might be tedious to type all of those pseudo-token names and unit productions. You can take advantage of the fact that you can generate the docstring yourself as long as it is in place before you build the parser by calling yacc.yacc
. You'll also need to add the pseudotoken names to the list of token names so that Ply knows that B_
is a token.