2
votes

I'm trying to write PyParsing code capable of parsing any Python code (I know that the AST module exists, but that will just be a starting point - I ultimately want to parse more than just Python code.)

Anyways, I figure I'll just start by writing something able to parse the classic

print("Hello World!")

So here's what I wrote:

from pyparsing import (alphanums, alphas, delimitedList, Forward,
                       quotedString, removeQuotes, Suppress, Word)

expr = Forward()
string = quotedString.setParseAction(removeQuotes)
call = expr + Suppress('(') + Optional(delimitedList(expr)) + Suppress(')')
name = World(alphas + '_', alphanums + '_')
expr <<= string | name | call

test = 'print("Hello World!")'
print(expr.parseString(test))

When I do that, though, it just spits out:

['print']

Which is technically a valid expr - you can type that into the REPL and there's no problem parsing it, even if it's useless.

So I thought maybe what I would want is to flip around name and call in my expr definition, so it would prefer returning calls to names, like this:

expr <<= string | call | name

Now I get a maximum recursion depth exceeded error. That makes sense, too:

  1. Checks if it's an expr.
    1. Checks if it's a string, it's not.
    2. Checks if it's a call.
      1. It must start with an expr, return to start of outer list.

So my question is... how can I define call and expr so that I don't end up with an infinite recursion, but also so that it doesn't just stop when it sees the name and ignore the arguments?

Is Python code too complicated for PyParsing to handle? If not, is there any limit to what PyParsing can handle?

(Note - I've included the general tags , , and , because I suspect this is a general recursive grammar definition problem, not something necessarily specific to .)

1
My guess is that your code is failing to consume terminals and so will parse the same token over and over leading to infinite recursion. I don't know pyparsing, but this sample of a Python parser on the PyParsing home page does consume parsed terminals.msw
Parsing of arbitrary ("more than just Python") source code is actually hard; you have to deal with many, many issues, the least of which is what atoms compose your language. Once you "parse" something, then you will not have enough to do much useful in practice. See my bio on "Life After Parsing" (which talks about the parsing issues, too).Ira Baxter
@IraBaxter - Your Life after Parsing assumes I want to do more complicated things than I do. I'm looking to write tools to go back and forth between ASTs and Strings - nothing more. I don't want to do any sort of analysis of it. Further, I'll need to distribute everything in source form for free - I'm not buying your product no matter how perfectly it does what I need.ArtOfWarfare
I'm not asking you to buy my product. I'm making the point to doing sophisticated tasks with code requires more "ASTs and strings". (There being little point in only going back and forth between ASTs and strings, what exactly are you doing that requires nothing more than that, for multiple languages?). You are welcome to achieve that "more sophistication" any way you like including building it all from scratch, or contributing to some other open source solution. I will suggest this is a lot harder than you seem to think.Ira Baxter
Even if you stick to just "ASTs and strings", the problem is still very hard for modern languages and it isn't getting easier. See for example,stackoverflow.com/questions/243383/… And I would be suprrised if your intentions are limited to just those languages that are "easy" to parse.Ira Baxter

1 Answers

1
votes

Your grammar is left recursive: expr expects a call which expects an expr which expects a call... If PyParsing can't handle left recursion, you need to change the grammar to something that PyParsing can work with.

One approach to remove direct left recursion is to change a gramar rule such us:

A = A b | c

into

A = c b*

In your case, left recursion is indirect: it doesn't happen in expr, but in a sub rule (call):

E = C | s | n
C = E x y z

To remove indirect left recursion you usually "lift" the definition of the sub-rule to the main rule. Unfortunatelly this removes the offending sub rule from the grammar -- in other words, you lose some structural expressiveness when you do that.

The previous example, with indirect recursion removed, would look like this:

E = E x y z | s | n

At this point, you have direct left recursion, which is easier to transform. When you deal with that, the result would be something like this -- in pseudo EBNF:

E = (s | n) (x y z)*

In your case, the definition of Expr would become:

Expr = (string | name) Args*
Args = "(" ExprList? ")"
ExprList = Expr ("," Expr)*