4
votes

I'm building a Definite Clause Grammar to parse 20,000 pieces of semi-natural text. As the size of my database of predicates grows (now up to 1,200 rules), parsing a string can take quite a long time -- particularly for strings that are not currently interpretable by the DCG, due to syntax I haven't yet encoded. The current worst-case is 3 minutes for a string containing 30 words. I'm trying to figure out how I can optimize this, or if I should just start researching cloud computing.

I'm using SWI-Prolog, and that provides a "profile" goal, which provides some statistics. I was surprised to find that the simplest rules in my database are taking up the majority of execution time. My corpus contains strings that represent numbers, and I want to capture these in a scalar/3 predicate. These are hogging ~50-60% of total execution time.

At the outset, I had 70 lines in my scalars.pl, representing the numeric and natural language representations of the numbers in my corpus. Like so:

scalar(scalar(3)) --> ["three"].
scalar(scalar(3)) --> ["3"].
scalar(scalar(4)) --> ["four"].
scalar(scalar(4)) --> ["4"].

...and so on.

Thinking that the length of the file was the problem, I put in a new rule that would automatically parse any numeric representations:

scalar(scalar(X)) --> [Y], { atom_number(Y, X) }.

Thanks to that, I've gone from 70 rules to 31, and helped a bit -- but it wasn't a huge savings. Is there anything more that can be done? My feeling is maybe not, because what could be simpler than a single atom in a list?

These scalars are called in a lot of places throughout the grammar, and I assume that's the root of the issue. Though they're simple rules, they're everywhere, and unavoidably so. A highly general grammar just won't work for my application, and I wouldn't be surprised if I end up with 3,000 rules or more.

I've never built a DCG this large, so I'm not sure how much I can expect in terms of performance. Happy to take any kind of advice on this one: is there some other way of encoding these rules? Should I accept that some parses will take a long time, and figure out how to run parses in parallel?

Thank you in advance!

EDIT: I was asked to provide a reproducible example, but to do that I'd have to link SO to the entire project, since this is an issue of scale. Here's a toy version of what I'm doing for the sake of completeness. Just imagine there were large files describing hundreds of nouns, hundreds of verbs, and hundreds of syntactic structures.

sent(sent(VP, NP)) --> vp(VP), np(NP).
vp(vp(V)) --> v(V).
np(np(Qty, Noun)) --> qty(Qty), n(Noun).
scalar(scalar(3)) --> ["three"].
scalar(scalar(X)) --> [Y], { atom_number(Y, X) }.

qty(qty(Scalar)) --> scalar(Scalar).
v(v(eat)) --> ["eat"].
n(n(pie)) --> ["pie"].
2
Can you give a reproducible example? Like, the full code you run and the data you run it on? In principle, better indexing could help.The rule you have at the moment does indeed look slightly suspicios, but I wouldn't know what to measure.User9213
What kind of lists are your parsing? List of strings or atoms?false
Two responses before I even woke up, thank you both! User9213: I'll definitely be open-sourcing this project down the line, but at the moment the house is pretty messy and not quite ready for guests, so to speak. May I send you something when it does reach a better state? For now, indexing seems like a promising lead, thank you. false: Sorry, was loose with my terminology -- it's always lists of strings at the root level, I'm feeding in sentences that ["have", "been", "tokenized", "like ", "this", ":"]. These get converted into trees that contain atoms, like: vp(v(tokenize)).Erik G
Hi, open sourcing was not what I meant. Just give us, in the question, enough code and some example data so that we can run the code, then hopefully improve it, then compare to the original.User9213
No, it can be much better than what you are doing at the moment. I am currently writing the code for an answer. You should probably show the 30-word string that causes your predicate to take 3 minutes.User9213

2 Answers

5
votes

One aspect of your program that you might investigate is to make sure individual predicates succeed quickly and fail quickly. This is particularly useful to check for predicates that have many clauses.

For instance, when scalar(X) is evaluated on a token that is not a scalar then the program will have to try 31 (by your last count) times before it can determine that scalar//1 fails. If the structure of your program is such that scalar(X) is checked against every token then this could be very expensive.

Further, if scalar(X) does happen to find that a token matches but a subsequent goal fails then it appears that your program will retry the scalar(X) until all of the scalar//1 clauses have been attempted.

The judicious use of cut (!) or if-then-else (C1->G1;C2->G2;G3) can provide a tremendous performance improvement. Or you can structure your predicates so that they rely on indexing to select the appropriate clause. E.g.:

scalar(scalar(N)) --> [Token], {scalar1(Token, scalar(N))}.

scalar1("3", scalar(3)) :- !.
scalar1(Y, scalar(X)) :- atom_number(Y, X).

This uses both cut and clause indexing (if the compiler provides it) with the scalar1/1 predicate.

EDIT: You should read R. A. O'Keefe's The Craft of Prolog. It is an excellent guide to the practical aspects of Prolog.

0
votes

Here's how I've tackled performance and optimization problems as a novice Prologer.

1.) Introduce timeouts to your application. I'm calling Prolog via the subprocess module in Python 3.6, and that allows you to set a timeout. As I've worked with my code base more, I've got a pretty good sense of how long a successful parse might take, and can assume anything taking longer is not going to work.

2.) Make use of the graphical profiler that's packaged in the swi-prolog IDE. This gives a lot more insight, as you can bounce around the call tree. I found it particularly helpful to sort predicates by the execution time of their children. Before I was thinking about it like pollution in a river. "Man, there's a lot of junk floating in here," I thought, not considering that upstream some factories were contributing a lot of that junk.

As for how to optimize a DCG without hurting the semantics & expressivity of one's grammar, I think that will have to be a question for another Stack Overflow. And as for my initial question, that's still an open one -- predicates that seem simple (to me) take quite a while.