I'm building a Definite Clause Grammar to parse 20,000 pieces of semi-natural text. As the size of my database of predicates grows (now up to 1,200 rules), parsing a string can take quite a long time -- particularly for strings that are not currently interpretable by the DCG, due to syntax I haven't yet encoded. The current worst-case is 3 minutes for a string containing 30 words. I'm trying to figure out how I can optimize this, or if I should just start researching cloud computing.
I'm using SWI-Prolog, and that provides a "profile" goal, which provides some statistics. I was surprised to find that the simplest rules in my database are taking up the majority of execution time. My corpus contains strings that represent numbers, and I want to capture these in a scalar/3
predicate. These are hogging ~50-60% of total execution time.
At the outset, I had 70 lines in my scalars.pl
, representing the numeric and natural language representations of the numbers in my corpus. Like so:
scalar(scalar(3)) --> ["three"].
scalar(scalar(3)) --> ["3"].
scalar(scalar(4)) --> ["four"].
scalar(scalar(4)) --> ["4"].
...and so on.
Thinking that the length of the file was the problem, I put in a new rule that would automatically parse any numeric representations:
scalar(scalar(X)) --> [Y], { atom_number(Y, X) }.
Thanks to that, I've gone from 70 rules to 31, and helped a bit -- but it wasn't a huge savings. Is there anything more that can be done? My feeling is maybe not, because what could be simpler than a single atom in a list?
These scalars are called in a lot of places throughout the grammar, and I assume that's the root of the issue. Though they're simple rules, they're everywhere, and unavoidably so. A highly general grammar just won't work for my application, and I wouldn't be surprised if I end up with 3,000 rules or more.
I've never built a DCG this large, so I'm not sure how much I can expect in terms of performance. Happy to take any kind of advice on this one: is there some other way of encoding these rules? Should I accept that some parses will take a long time, and figure out how to run parses in parallel?
Thank you in advance!
EDIT: I was asked to provide a reproducible example, but to do that I'd have to link SO to the entire project, since this is an issue of scale. Here's a toy version of what I'm doing for the sake of completeness. Just imagine there were large files describing hundreds of nouns, hundreds of verbs, and hundreds of syntactic structures.
sent(sent(VP, NP)) --> vp(VP), np(NP).
vp(vp(V)) --> v(V).
np(np(Qty, Noun)) --> qty(Qty), n(Noun).
scalar(scalar(3)) --> ["three"].
scalar(scalar(X)) --> [Y], { atom_number(Y, X) }.
qty(qty(Scalar)) --> scalar(Scalar).
v(v(eat)) --> ["eat"].
n(n(pie)) --> ["pie"].