2
votes

I'm building an ANTLR parser for a small query language. The query language is by definition ambiguous, and we need all possible interpretations (ASTs) to process the query.

Example:

    query   : CLASSIFIED_TOKEN UNCLASSIFIED_TOKEN 
            | ANY_TOKEN UNCLASSIFIED_TOKEN 
            ;

In this case, if input matches both rules, I need to get 2 ASTs with both interpretations. ANTLR will return the first matched AST.

Do you know a simple way to get all possible ASTs for the same grammar? I'm thinking about running parser multiple times, "turning off" already matched rules between iterations; this seems dirty. Is there a better idea? Maybe other lex/parser tool with java support that can do this?

Thanks

2
See my remark about combinatorics on another answer.Ira Baxter

2 Answers

2
votes

If I were you, I'd remove the ambiguities. You can often do that by using contextual information to determine which grammar rules actually trigger. For instance, in

C* X;

in C (not your language, but this is just to make a point), you can't tell if this is just a pointless multiplication (legal to write in C), or a declaration of a variable X of type "pointer to C". So, there are two valid (ambiguous) parses. But if you know that C is a type declaration (from some context, perhaps an earlier code declaration), you can hack the parser to kill off the inappropriate choices and end up with just the one "correct" parse, no ambiguities.

If you really don't have the context, then you likely need a GLR parser, which happily generate both parses in your final tree. I don't know of any available for Java.

Our DMS Software Reengineering Toolkit [not a Java-based product] has GLR parsing support, and we use that all the time to parse difficult languages with ambiguities. The way we handle the C example above is to produce both parses, because the GLR parser is happy to do this, and then if we have additional information (such as symbol table table), post-process the tree to remove the inappropriate parses.

DMS is designed to support the customized analysis and transformation of arbitrary languages, such as your query language, and makes it easy to define the grammar. Once you have a context-free grammar (ambiguities or not), DMS can parse code and you can decide what to do later.

1
votes

I doubt you're going to get ANTLR to return multiple parse trees without wholesale rewriting of the code.

I believe you're going to have to partition the ambiguities, each into its own unambiguous grammar and run the parse multiple times. If the total number of ambiguous productions is large you could have an unmanageable set of distinct grammars. For example, for three binary ambiguities (two choices) you'll end up with 8 distinct grammars, though there might be slightly fewer if one ambiguous branch eliminates one or more of the other ambiguities.

Good luck